Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW, IBM has a wonderful speech to text API...I've put together a repo of examples and Python code:

https://github.com/dannguyen/watson-word-watcher

One of the great things about it is its word-level time stamp and confidence data that it returns...here's a few super cuts I've made from the presidential primary debates:

https://www.youtube.com/watch?v=VbXUUSFat9w&list=PLLrlUAN-Lo...

It's not perfect by any means, but the granular results give you a place to start from...here's a super cut of cuss words from a well known episode of The Wire...only 59 such words were heard by Watson even though one scene contains 30+ F-bombs alone:

https://www.youtube.com/watch?v=muP5aH1aWUw&feature=youtu.be

The service is free for the first 1000 minutes each month.



It took me a while to understand what you did here. I was waiting for some kind of subtitles showing the recognition ability.

But you are saying you performed speech recognition on the full video then edited it according to where the words you targeted were found. I liked the bomb/terrorist one, the others didn't seem to be "saying" anything.


Yeah, I was a bit lazy...I could have used moviepy (which I currently use, but merely as a wrapper around ffmpeg) to add subtitles to show which identified word was identified...I'm hoping to make this into a command-line tool for myself to quickly transcribe things...though making supercuts is just a fun way to demonstrate the concepts.

The important takeaway is that the Watson API parses a stream of spoken audio (other services, such as Microsoft's Oxford, works only on 10-second chunks, i.e. optimized for user commands) and tokenizes it...what you get is a timestamp for when each recognized word appears, as well as a confidence level and alternatives if you so specify. Other speech-transcription options don't always provide this...I don't think PocketSphinx does, for example. Or sending your audio to a mTurk based transcription service.

Here's a little more detail about The Wire transcription, along with the JSON that Watson returns, and a simplified CSV version of it:

https://github.com/dannguyen/watson-word-watcher/tree/master...


Youtube speech recognition is getting quite good, at least for talking heads in English. Are there additional top tier API's other than the IBM?


AT&T has its own "Watson"...but it requires signing up for a premium account, which I think involves an upfront cost:

http://developer.att.com/apis/speech

Twilio has one that also requires payment:

https://www.twilio.com/docs/api/rest/transcription

It limits input audio to 2 minutes. And I would have to guess that its model is specifically tuned to phone messages, i.e. one speaker, relatively clear and focused audio, and certain probabilities of phrases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: