Audio boards based on ESP32 boards are quite under the radar and have lovely features for just a few bucks. Running LibreASR on a RPi should also be feasible soon.
IBM kept their US and UK models apart. May have been historic or dataset size.
As a FYI, I was told “the money” was in specific “dictionaries” for medical professionals and so forth. Apparently, doctors liked to dictate straight into text. Might be worth trying that $$$€€€£££?
Do you get good results when adding scraped youtube audio? My model performance on LibriSpeech dev drops a bit when adding youtube audio to the training dataset ( my guess is likely due to poor alignment from auto generated captions ).
I haven't trained on LibriSpeech exclusively, but yes, the perf on LibriSpeech dev is quite bad, around ~60.0 WER. If the poor alignment of yt captions is the issue, maybe concatenating multiple samples helps a bit.
would it be possible to train on any of the more recent Text to speech engines out there? some of them are very realistic.
this would give you absolutely perfect sync down to the word, I assume... I don't know about the cost if you paid ratecard though, perhaps you can do some partnership with them since yours is a symetrical product
I've been working on this for a while now.
While there are other on-premise solutions using older models such as DeepSpeech [0], I haven't found a
deployable project supporting multiple languages using the recent RNN-T Architecture [1].
Please note that this does not achieve SotA performance.
Also, I've only trained it on one GPU so there might be room for improvement.
Edit: Don't expect good performance :D this is still in early stage development. I am looking for contributers :)
You can also check out https://github.com/TensorSpeech/TensorFlowASR for inspiration (not my project, not involved). It implements streaming transformers and conformer RNN-T (but in TF2). Deployment on device as TFLite. So far, there aren't many usable pretrained models available (just LibriSpeech), but with some work it could turn out quite nicely.
Data and compute are the largest hurdles. I only have one GPU and training one model takes 3+ days, so I am limited by that. Also, scraping from YouTube takes time and a lot of storage (multiple TBs).
Mozilla Common Voice data is already used for training.
Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.
I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.
Depending on your objective, noisy data might be useful.
I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that.
But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.
There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning
For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/
I don't know update prices but I remember it was quite affordable.
I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.
EDIT: Typed this from memory, there may be errors in the alias.
That's called setting up expectations. If you know your project might interest people but needs work, why pretends it's good when it's not? They seem to be courting contributors more than users anyway.
I found the video to be funny. It nicely highlights both the current limitations and the ambition of the projects. Bold choice certainly but I think it works.
and it's maybe a dig at Macron's accent at the same time :D
although the author is a student in germany. Anyway you should join the Discord, we discussed this there too...