FWIW, IBM has a wonderful speech to text API...I've put together a repo of examp...

pbw · on Feb 25, 2016

It took me a while to understand what you did here. I was waiting for some kind of subtitles showing the recognition ability.

But you are saying you performed speech recognition on the full video then edited it according to where the words you targeted were found. I liked the bomb/terrorist one, the others didn't seem to be "saying" anything.

danso · on Feb 25, 2016

Yeah, I was a bit lazy...I could have used moviepy (which I currently use, but merely as a wrapper around ffmpeg) to add subtitles to show which identified word was identified...I'm hoping to make this into a command-line tool for myself to quickly transcribe things...though making supercuts is just a fun way to demonstrate the concepts.

The important takeaway is that the Watson API parses a stream of spoken audio (other services, such as Microsoft's Oxford, works only on 10-second chunks, i.e. optimized for user commands) and tokenizes it...what you get is a timestamp for when each recognized word appears, as well as a confidence level and alternatives if you so specify. Other speech-transcription options don't always provide this...I don't think PocketSphinx does, for example. Or sending your audio to a mTurk based transcription service.

Here's a little more detail about The Wire transcription, along with the JSON that Watson returns, and a simplified CSV version of it:

https://github.com/dannguyen/watson-word-watcher/tree/master...

th-ai · on Feb 25, 2016

Youtube speech recognition is getting quite good, at least for talking heads in English. Are there additional top tier API's other than the IBM?

danso · on Feb 25, 2016

AT&T has its own "Watson"...but it requires signing up for a premium account, which I think involves an upfront cost:

http://developer.att.com/apis/speech

Twilio has one that also requires payment:

https://www.twilio.com/docs/api/rest/transcription

It limits input audio to 2 minutes. And I would have to guess that its model is specifically tuned to phone messages, i.e. one speaker, relatively clear and focused audio, and certain probabilities of phrases.