This is amazing, I wonder how I can do this offline, using open source tools. Ar...

hobofan · on Jan 5, 2021

> Are there any really good open source speech to text programs?

I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).

There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.

Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..

albertzeyer · on Jan 5, 2021

Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).

Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.

Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.

ESPNet (https://github.com/espnet/espnet) is good and well known for end-to-end models as well.

RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).

Wav2Letter (https://github.com/facebookresearch/wav2letter), the tool by Facebook.

These are probably just the most well known. There are many others as well. DeepSpeech is inferior to all the ones above, but maybe simpler.

Important is also the data to train these, but you will find quite some resources online for English, e.g. Tedlium, Librispeech, etc.

You will find lots of resources actually for ASR. Some random links:

https://github.com/gooofy/zamia-speech

https://commonvoice.mozilla.org/en/datasets

https://www.openslr.org/resources.php

To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.