We recently did a comparative analysis of cloud speech-to-text providers for a project. We looked at:
1. Google Cloud Speech API
2. Microsoft Bing Speech API
3. IBM Watson Speech to Text
The ranking was as listed above but we had real challenges working with call-center audio recordings. The quality was less than idea but still very clear. We saw a huge reduction in accuracy compared to in-browser testing. Additionally Australian English is particularly not solved.
Because Google's API isn't currently doing speaker-detection, we looked at using Watson's speaker-detection as a secondary step but found it too complex and error prone. There is definitely room for a startup in this area and it also needs continued investment from the bigger cloud providers.
For Watson Speech to Text - Did you choose the correct model to match your source audio quality? They default to a "Broadband" model intended for high quality audio sources, but you can also select "Narrowband" for things like phone quality. Not guaranteeing a difference, but in my experience, matching the source quality to the correct model makes some difference.
I've not compared them extensively but for streaming realtime, I found that Watson beat the Google api for a specific use-case. Your mileage may vary!
1. Google Cloud Speech API
2. Microsoft Bing Speech API
3. IBM Watson Speech to Text
The ranking was as listed above but we had real challenges working with call-center audio recordings. The quality was less than idea but still very clear. We saw a huge reduction in accuracy compared to in-browser testing. Additionally Australian English is particularly not solved.
Because Google's API isn't currently doing speaker-detection, we looked at using Watson's speaker-detection as a secondary step but found it too complex and error prone. There is definitely room for a startup in this area and it also needs continued investment from the bigger cloud providers.