Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We recently did a comparative analysis of cloud speech-to-text providers for a project. We looked at:

1. Google Cloud Speech API

2. Microsoft Bing Speech API

3. IBM Watson Speech to Text

The ranking was as listed above but we had real challenges working with call-center audio recordings. The quality was less than idea but still very clear. We saw a huge reduction in accuracy compared to in-browser testing. Additionally Australian English is particularly not solved.

Because Google's API isn't currently doing speaker-detection, we looked at using Watson's speaker-detection as a secondary step but found it too complex and error prone. There is definitely room for a startup in this area and it also needs continued investment from the bigger cloud providers.



For Watson Speech to Text - Did you choose the correct model to match your source audio quality? They default to a "Broadband" model intended for high quality audio sources, but you can also select "Narrowband" for things like phone quality. Not guaranteeing a difference, but in my experience, matching the source quality to the correct model makes some difference.

I've not compared them extensively but for streaming realtime, I found that Watson beat the Google api for a specific use-case. Your mileage may vary!

They also provide a handy Mic / File reader interface for browsers: https://github.com/watson-developer-cloud/speech-javascript-...


Here are some benchmarks on telephone speech, including both APIs and human transcription services:

https://remeeting.com/app/benchmarks

Google actually did pretty badly for us on extended telephone speech. Not sure why.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: