That's cool, but Whisper is open source and I can run it today on my machine (even without a GPU) - it gives great results even compiled to WebAssembly and running in the browser with smaller models.
Totally free.
This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.
For my sci-fi story (alpha readers wanted; see profile), I used Whisper to transcribe an interview of a Malawian President. From there, I created a vocabulary comprised of only the president's words, which I used almost exclusively when writing his speech.
The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:
* Download and install podman.
* Download and install git.
* Download and install curl.
* Open a command prompt.
* Run the following commands to containerize Whisper:
Whisper is great. You can get faster results running the tiny model. I used it for podcast transcription and it is much faster and the quality is not worse than the medium model - there are some podcast episodes that the transcription is the same.
It's interesting because while evaluating Whisper for an ASR task I found it to have some entertaining hallucinations when provided with silent or garbled audio.
For instance, this was added to the transcription of a silent section of audio:
> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel
It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.
Similar experience, mine would turn background noises when no one's talking into random japanese words repeated. I was using the large model. I ended up fixing it by using the medium.en model and setting condition_on_previous_text to false.
Now if only the timestamp timings were correct for noisy audio... I've tried stable whisper and another fork I forget the name, but I need to run the audio through RTX voice if I want consistent timestamps...
I wonder if that's specifically a result of training it on youtube videos where the audio/subtitles don't actually match (recipe videos with no speech that have additional text added in the subtitles)
I'm imagining the near future will see a portable fast streaming model for real-time voice translation, piped into text to speech. Hook it up to an earpiece and you've got a real-life Babelfish
Whisper doesn’t do streaming transcription or translation. It’s one of its architectural disadvantages over more commonly used ASR approaches, such as transducers (see e.g. Microsoft Research’s system trained on ~350k hours that does streaming transcription and translation: https://arxiv.org/abs/2211.02499).
Whisper is incredible, and it's bananas that they give it away for free. It is able to decipher English with the thickest of accents, often better than me.
It's not even possible for me to verify the accuracy claims of this closed model with limited API access.
I really don't care about closed API models of anything that has a good/usable open source version. Whisper works well enough, I'm never going to follow up on this USM research or use it. The only reason to pay for the API access would be for some super niche language. And if Google is paywalling this only for the few customers who need it for use in terribly under-represented communities ... that's a kind of douchebaggery all its own.
The only reason people are paying for OpenAI's GPT-4 is because there's literally no usable open-source LLM. The instant a "good enough" one exists, OpenAI's revenues will drop by >95%.
Hopefully Google will at least use this in Google Home because it's still bad enough to notice.
I mean, some people--like myself--are already paying for Google's multi-language speech recognition API, and have been for years, so the idea that there is a new even-better model for it sounds cool to me? My primary annoyance is that this is Google, so of course they aren't going to put in even the minimal effort to just make this a new backend for their existing ridiculously-insane API I had to build a miserable slightly-custom http/2 stack to access :/.
Regardless, I don't want to use the API, but I'm working with public information anyway; and so, while I have considered moving to Whisper now that that's an option, it hasn't been a priority and it isn't clear to me that Whisper is good at random non-English languages anyway.
On the youtube captions dataset, it says english (US) WER is nearly 15%. On those 18 languages, it's nearly 20%. How does that match up with 32% relative being one per thousand or so?
In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.
I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.
Totally free.
This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.