It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud. It is definitely a dream I hope we one day achieve, thanks for the article, will test it on my day off and play with it a bit.
Pocketsphinx/Sphinx with a small, use-case specific dictionary showed much better accuracy for my accent and speech defects, than any of these cloud based recognition systems. I used a standard acoustic model, but it probably would have been even more accurate had I trained a custom acoustic model.
For simple use cases like home automation or desktop automation, I think it's a more practical approach than depending on a cloud API.
I haven't tried out Pocket Sphinx myself...could you describe the training process, e.g. how long did it take, how much audio did you have to record, how easy was it to iterate to improve accuracy?
PocketSphinx/Sphinx use three models - an acoustic model, a language model and a phonetic dictionary.
I'm no expert, but as I understand them, the acoustic model converts audio samples into phonemes(?),
the language model contains probabilities of sequences of words, and the phonetic dictionary is a mapping of words to phonemes.
Initially, I just used standard en-us acoustic model, US english generic language model, and its associated phonetic dictionary.
This was the baseline for judging accuracy. It was ok, but neither fast nor very accurate (likely due to my accent and speech defects).
I'd say it was about 70% accurate.
Simply reducing the size of the vocabulary boosts accuracy because there is that much less chance of a mistake. It also improves recognition speed.
For each of my use cases (home and desktop automation), I created a plain text file with the relevant command words.
Then used their online tool [1] to generate a language model and phonetic dictionary from it.
For the acoustic model, there are two approaches - "adapting" and "training".
Training is from scratch, while adapting adapts a standard acoustic model to better match personal accent or dialect or speech defects.
I found training as described [2] rather intimidating, and never tried it out. This is likely to take a lot of time (a couple of days atleast I think, based on my adaptation experience).
Instead I "adapted" the en-us acoustic model [3].
About an hour to come up with some grammatically correct text that included all the command words and phrases I wanted.
Then reading it aloud while recording using Audacity. I attempted this multiple times, fiddling around with microphone volume and gain,
trying to block ambient noise (I live in a rather noisy env), redoing it, final take. Took around 8 hours altogether with breaks.
Finally generating the adapted acoustic model. About an hour.
About 95% of the time it understands what I say. About 5% of the time, I have to repeat. Especially with phrases.
Did this on both a desktop and raspberry pi. The Pi is the one managing home automation. I'm happy with it :)
If not confidential, can you describe what kinds of automation you used this for, particularly the desktop automation?
I was interested in automating transcription to text of my own reminders to myself and other such audio files, say taken on the PC or on a portable voice recorder, hence the earlier trials I did. But at the time nothing worked out well enough, IIRC.
Nothing confidential at all :). I was playing with them because I personally don't like using keyboard and mouse, and also have some ideas for making computing easier for handicapped people.
My current desktop automation is doing command recognition. Commands like "open editor / email / browser", "shutdown", "suspend"...about 20 commands in all. 'pocketsphinx_continuous' is started as a daemon at startup and keeps listening in the background (I'm on Ubuntu).
I think from a speech recognition internals point of view transcription is more complex than recognizing these short command phrases. The training or adaptation corpus would have to be much larger than what I used.
He he, the voice "shutdown" command you mention reminds me of a small assembly language routine that I used to use to reboot MSDOS PCs; it was just a single instruction to jump to the start of the BIOS (cold?) boot entry point, IIRC (JMP F000:FFF0 or something like that). Used to enter it into DOS's DEBUG.COM utility with the A command (for Assemble) and then write it out to disk as a tiny .COM file. (IOW, you did not even need an assembler to create it.)
Then you could reboot the PC just by typing:
REBOOT
at the DOS prompt.
Did all kinds of tricks of the trade (not just like that, many other kinds), in the earlier DOS and (more in) UNIX days ... Good fun, and useful to customers, many a time, too, including saving their bacon (aka data) multiple times (with, of course, no backups by them).
My impression is that the super accurate stuff like Google's voice recognition and Siri are all feuled by masssssssive amounts of data. So you build up these recognition networks based off of a bunch of data sources, and get better over time, but the recognition is more based off of the data than the code.
It's the whole "Memory is a process, not a hard drive" thing: Voice recognition as it is today is a slowly evolving graph from input data. You could in theory compress the graph and have it available offline. But it would be hard to chop it up in a way that doesn't completely bust the recognition.
There's actually some research on compressing ANN's to the size that it could be embedded in all sorts of devices. I think I saw something about it on HN a few months back?
> It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud.
Well, I guess at some point this functionality will become part of the OS. When OSX and Windows offer this, then Linux cannot stay behind, and we will see open source speech recognition libraries.
There are plenty of those. Voice recognition is nothing new—I remember playing around with "speakable items" back in Mac OS 7. It did well enough to recognize certain key words and phrases.
Every Mac with OS 10.9 and later comes with speech recognition software that works without internet access and is really really good. You can dictate entire documents, emails, even have it type commands in terminal etc. In OS X 10.11 you can even drive the UI by only speaking.
> It really would be amazing to be able to get voice recognition software that covers at least recognizing a small enough fraction of our language to be useful without having to reach the cloud.
Are there any academic groups working on this topic, and do they have prototype implementations?
Julius [1] is a pretty good offline speech recognition engine. In my tests it seems to have about 95% accuracy in grammar-based models, and it supports continuous dictation. There is also a decent Python module which supports Python 2, and Python 3 with a few tweaks.
HOWEVER:
The only continuous dictation models available for Julius are Japanese, as it is a Japanese project. This is mainly an issue of training data. The VoxForge models are working towards releasing one for English once they get 140 hours of training data (last time I checked they were around 130); but even so the quality is likely to be far less than commercial speech recognition products, which generally have thousands of hours of training.
Julius is my preferred speech recognition engine. I've built an application[0] which enables users to control their Linux desktops with their voices, and uses Julius to do the heavy lifting.
After a quick look, it seems Julius doesn't use the new deep-learning stuff?
In terms of data,
http://www.openslr.org/12/
says it has 300 hours + of speech+text from librivox audiobooks. Using Librovox recordings seemed a great idea for making a freely available large dataset.