Hacker Newsnew | past | comments | ask | show | jobs | submit | iceychris's commentslogin

I'm using NixOS with i3 as my daily driver, can recommend.


Audio boards based on ESP32 boards are quite under the radar and have lovely features for just a few bucks. Running LibreASR on a RPi should also be feasible soon.

Thank you for your kind words! :)


Yes, probably. The data I trained on mostly reflects UK and US accents.


IBM kept their US and UK models apart. May have been historic or dataset size.

As a FYI, I was told “the money” was in specific “dictionaries” for medical professionals and so forth. Apparently, doctors liked to dictate straight into text. Might be worth trying that $$$€€€£££?


Hey blackcat! Your project [0] helped me a lot! Pre-training the encoder sounds great, I'll maybe add it in the future.

[0] https://github.com/theblackcat102/Online-Speech-Recognition


I have not yet trained a french model. Also, the gif shows Macron speaking to the congress with his english accent [0]

[0] https://www.youtube.com/watch?v=RqUc1h7bZQ4


Right, fixed it, thank you :D


As I commented above, very poorly. It's still early days.


LibriSpeech, Tatoeba, Common Voice and scraped YouTube videos.


Do you get good results when adding scraped youtube audio? My model performance on LibriSpeech dev drops a bit when adding youtube audio to the training dataset ( my guess is likely due to poor alignment from auto generated captions ).


I haven't trained on LibriSpeech exclusively, but yes, the perf on LibriSpeech dev is quite bad, around ~60.0 WER. If the poor alignment of yt captions is the issue, maybe concatenating multiple samples helps a bit.


You should consider realignment; maybe start with something like DSAlign or my wav2train project.


would it be possible to train on any of the more recent Text to speech engines out there? some of them are very realistic.

this would give you absolutely perfect sync down to the word, I assume... I don't know about the cost if you paid ratecard though, perhaps you can do some partnership with them since yours is a symetrical product


The upper transcript is YouTube's automatic transcription. Below is the web app transcribing live. And yes, it is actually missing a few words.


Hey HN!

I've been working on this for a while now. While there are other on-premise solutions using older models such as DeepSpeech [0], I haven't found a deployable project supporting multiple languages using the recent RNN-T Architecture [1].

Please note that this does not achieve SotA performance. Also, I've only trained it on one GPU so there might be room for improvement.

Edit: Don't expect good performance :D this is still in early stage development. I am looking for contributers :)

[0] https://github.com/mozilla/DeepSpeech

[1] https://arxiv.org/abs/1811.06621


You can also check out https://github.com/TensorSpeech/TensorFlowASR for inspiration (not my project, not involved). It implements streaming transformers and conformer RNN-T (but in TF2). Deployment on device as TFLite. So far, there aren't many usable pretrained models available (just LibriSpeech), but with some work it could turn out quite nicely.


Hi! What should you need to implement other language i.e. Italian or French? I mean: it's a problem due to the less of datas or what?

Another question: could you use for example mozilla voice data to train/test?


Data and compute are the largest hurdles. I only have one GPU and training one model takes 3+ days, so I am limited by that. Also, scraping from YouTube takes time and a lot of storage (multiple TBs).

Mozilla Common Voice data is already used for training.


Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.

I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.


Depending on your objective, noisy data might be useful. I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that. But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.


There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning


For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/ I don't know update prices but I remember it was quite affordable.


> I remember it was quite affordable.

Relative to what? Paperspace is one of the costlier GPU providers.


Okay, you are right, but it's also really performant, so imho you can do a lot of work in minor time.

For something cheapest I read that post on reddit :

https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_clo...


performant? It's the same GPU..?


Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?


> Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

But you need supervised data too.


I know you can scrape only audio from YouTube with YouTubeDL but it’s somewhat annoying


I use something akin to

    'alias downloadmusic='youtube-dl --extract-audio --audio-quality 0 --extract-metadata'
in my .bashrc

I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.

EDIT: Typed this from memory, there may be errors in the alias.


    youtube-dl -f bestaudio $URL
Dunno when that went in but it works now.


So do you scrap videos from youtube with subtitles to collect data?


Vosk supports both Italian and French. French model is trained by Linto project, pretty good one.


Awesome project - I'm also working on a similar idea for an on-premise ASR server! Any reason you decided to go with RNN-T?


[flagged]


That's called setting up expectations. If you know your project might interest people but needs work, why pretends it's good when it's not? They seem to be courting contributors more than users anyway.

I found the video to be funny. It nicely highlights both the current limitations and the ambition of the projects. Bold choice certainly but I think it works.


and it's maybe a dig at Macron's accent at the same time :D although the author is a student in germany. Anyway you should join the Discord, we discussed this there too...

https://discord.gg/pqTMeP5D3g


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: