More

iceychris · on Sept 20, 2021

I'm using NixOS with i3 as my daily driver, can recommend.

iceychris · on Nov 16, 2020

Audio boards based on ESP32 boards are quite under the radar and have lovely features for just a few bucks. Running LibreASR on a RPi should also be feasible soon.

Thank you for your kind words! :)

iceychris · on Nov 15, 2020

Yes, probably. The data I trained on mostly reflects UK and US accents.

OnlyMortal · on Nov 15, 2020

IBM kept their US and UK models apart. May have been historic or dataset size.

As a FYI, I was told “the money” was in specific “dictionaries” for medical professionals and so forth. Apparently, doctors liked to dictate straight into text. Might be worth trying that $$$€€€£££?

iceychris · on Nov 15, 2020

Hey blackcat! Your project [0] helped me a lot! Pre-training the encoder sounds great, I'll maybe add it in the future.

[0] https://github.com/theblackcat102/Online-Speech-Recognition

iceychris · on Nov 15, 2020

I have not yet trained a french model. Also, the gif shows Macron speaking to the congress with his english accent [0]

[0] https://www.youtube.com/watch?v=RqUc1h7bZQ4

iceychris · on Nov 15, 2020

Right, fixed it, thank you :D

iceychris · on Nov 15, 2020

As I commented above, very poorly. It's still early days.

iceychris · on Nov 15, 2020

LibriSpeech, Tatoeba, Common Voice and scraped YouTube videos.

blackcat201 · on Nov 15, 2020

Do you get good results when adding scraped youtube audio? My model performance on LibriSpeech dev drops a bit when adding youtube audio to the training dataset ( my guess is likely due to poor alignment from auto generated captions ).

iceychris · on Nov 15, 2020

I haven't trained on LibriSpeech exclusively, but yes, the perf on LibriSpeech dev is quite bad, around ~60.0 WER. If the poor alignment of yt captions is the issue, maybe concatenating multiple samples helps a bit.

lunixbochs · on Nov 15, 2020

You should consider realignment; maybe start with something like DSAlign or my wav2train project.

dcsan · on Nov 15, 2020

would it be possible to train on any of the more recent Text to speech engines out there? some of them are very realistic.

this would give you absolutely perfect sync down to the word, I assume... I don't know about the cost if you paid ratecard though, perhaps you can do some partnership with them since yours is a symetrical product

iceychris · on Nov 15, 2020

The upper transcript is YouTube's automatic transcription. Below is the web app transcribing live. And yes, it is actually missing a few words.

iceychris · on Nov 15, 2020

Hey HN!

I've been working on this for a while now. While there are other on-premise solutions using older models such as DeepSpeech [0], I haven't found a deployable project supporting multiple languages using the recent RNN-T Architecture [1].

Please note that this does not achieve SotA performance. Also, I've only trained it on one GPU so there might be room for improvement.

Edit: Don't expect good performance :D this is still in early stage development. I am looking for contributers :)

[0] https://github.com/mozilla/DeepSpeech

[1] https://arxiv.org/abs/1811.06621

woodson · on Nov 15, 2020

You can also check out https://github.com/TensorSpeech/TensorFlowASR for inspiration (not my project, not involved). It implements streaming transformers and conformer RNN-T (but in TF2). Deployment on device as TFLite. So far, there aren't many usable pretrained models available (just LibriSpeech), but with some work it could turn out quite nicely.

th3h4mm3r · on Nov 15, 2020

Hi! What should you need to implement other language i.e. Italian or French? I mean: it's a problem due to the less of datas or what?

Another question: could you use for example mozilla voice data to train/test?

iceychris · on Nov 15, 2020

Data and compute are the largest hurdles. I only have one GPU and training one model takes 3+ days, so I am limited by that. Also, scraping from YouTube takes time and a lot of storage (multiple TBs).

Mozilla Common Voice data is already used for training.

forgingahead · on Nov 16, 2020

Thanks for sharing this project. What do you think of the data with Mozilla Common Voice? The random sampling I looked at a while back seemed pretty poor -- background noise, stammering, delays in beginning the speaking, etc.

I was hoping to use it as a good training base, but the issues I encountered made me wary that the data quality would adversely affect any outcomes.

iceychris · on Nov 16, 2020

Depending on your objective, noisy data might be useful. I'd like LibreASR to also work in noisy environments, so training on data that is noisy should already help a bit with that. But yeah - stammering and delays are present not only in Common Voice but also Tatoeba and YouTube.

oulipo · on Nov 16, 2020

There is also many data from audiobooks in many languages that are easy to scrap and align using a basic model that has been updated for each language, or using Youtube videos with subtitles that are almost aligned for a first version of the model, then realigning

th3h4mm3r · on Nov 15, 2020

For the compute problem: maybe you can use cloud server gpu powered as https://www.paperspace.com/ I don't know update prices but I remember it was quite affordable.

whimsicalism · on Nov 15, 2020

> I remember it was quite affordable.

Relative to what? Paperspace is one of the costlier GPU providers.

th3h4mm3r · on Nov 15, 2020

Okay, you are right, but it's also really performant, so imho you can do a lot of work in minor time.

For something cheapest I read that post on reddit :

https://amp.reddit.com/r/devops/comments/dqh09n/cheapest_clo...

whimsicalism · on Nov 15, 2020

performant? It's the same GPU..?

jack_pp · on Nov 15, 2020

Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

whimsicalism · on Nov 15, 2020

> Why does it take a lot of data? Afaik you can select lower quality in youtube-dl but you don't even need video do you?

But you need supervised data too.

klysm · on Nov 15, 2020

I know you can scrape only audio from YouTube with YouTubeDL but it’s somewhat annoying

Shared404 · on Nov 15, 2020

I use something akin to

    'alias downloadmusic='youtube-dl --extract-audio --audio-quality 0 --extract-metadata'

in my .bashrc

I find that helps with the annoyance of downloading things off of YT. This is for music obviously, but there's an option to download subtitles as well.

EDIT: Typed this from memory, there may be errors in the alias.

jerf · on Nov 15, 2020

    youtube-dl -f bestaudio $URL

Dunno when that went in but it works now.

th3h4mm3r · on Nov 15, 2020

So do you scrap videos from youtube with subtitles to collect data?

trowngon · on Nov 16, 2020

Vosk supports both Italian and French. French model is trained by Linto project, pretty good one.

whimsicalism · on Nov 15, 2020

Awesome project - I'm also working on a similar idea for an on-premise ASR server! Any reason you decided to go with RNN-T?

the_biot · on Nov 15, 2020

[flagged]

brmgb · on Nov 15, 2020

That's called setting up expectations. If you know your project might interest people but needs work, why pretends it's good when it's not? They seem to be courting contributors more than users anyway.

I found the video to be funny. It nicely highlights both the current limitations and the ambition of the projects. Bold choice certainly but I think it works.

dcsan · on Nov 15, 2020

and it's maybe a dig at Macron's accent at the same time :D although the author is a student in germany. Anyway you should join the Discord, we discussed this there too...

https://discord.gg/pqTMeP5D3g