It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech
> Are there any really good open source speech to text programs?
I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).
There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.
Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..
Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).
Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.
Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.
RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).
To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.
Are there any really good open source speech to text programs? I imagine it's going to involve a pre-trained neural net.
[update] Following a thread https://news.ycombinator.com/item?id=20097542
It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech