That's cool, but Whisper is open source and I can run it today on my machine (ev...

thangalin · on March 30, 2023

For my sci-fi story (alpha readers wanted; see profile), I used Whisper to transcribe an interview of a Malawian President. From there, I created a vocabulary comprised of only the president's words, which I used almost exclusively when writing his speech.

The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:

* Download and install podman.

* Download and install git.

* Download and install curl.

* Open a command prompt.

* Run the following commands to containerize Whisper:

    git clone https://github.com/lablab-ai/whisper-api-flask whisper
    cd whisper
    mv Dockerfile Containerfile
    podman build --network="host" -t whisper .
    podman run --network="host" -p 5000:5000 whisper

* Download MP3 file (e.g., filename.mp3).

* Run the following command to produce a transcription:

    curl -F "file=@filename.mp3" http://localhost:5000/whisper

101008 · on March 30, 2023

Whisper is great. You can get faster results running the tiny model. I used it for podcast transcription and it is much faster and the quality is not worse than the medium model - there are some podcast episodes that the transcription is the same.

alach11 · on March 30, 2023

If speed is important, you're much better off using a larger model and whisper.cpp.

pantalaimon · on March 30, 2023

Wow thank you! That's a nice speedup indeed, with whisper I get

    33,53s user 2,05s system 443% cpu 8,023 total

with the 'tiny.en' model whereas whisper.cpp gives me

    22,71s user 0,12s system 745% cpu 3,062 total

with the 'base.en' model for a 15s audio clip on an i7-3770 (8 threads).

alach11 · on March 30, 2023

Awesome! Thanks for posting the stats.

In my workflows I've found rare but noticeable quality differences between the model sizes. So when practical I try to use the larger ones.

malborodog · on March 30, 2023

why not just run whisper from the command line directly? Why put it into a docker container??

ec109685 · on March 30, 2023

Why not keep everything tightly contained?

malborodog · on March 30, 2023

Hm, I'm on Mac so it takes up a bunch of ram and I'm not used to this workflow. good point though.

ec109685 · on March 31, 2023

Unless you actually use the memory (e.g. allocate it), it won’t impact system performance, but yeah, it definitely is overhead.

pdntspa · on March 30, 2023

some people just love making their environments needlessly complicated.

jazzyjackson · on March 30, 2023

complexity is in the eye of the beholder, some people just get docker enough that it's not a friction

Now installing the dependencies of every git repo I want to try on my host system, that's how an environment becoming needlessly complicated

hodanli · on March 30, 2023

thank you for this

mmcwilliams · on March 30, 2023

It's interesting because while evaluating Whisper for an ASR task I found it to have some entertaining hallucinations when provided with silent or garbled audio.

For instance, this was added to the transcription of a silent section of audio:

> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel

It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.

nodja · on March 30, 2023

Similar experience, mine would turn background noises when no one's talking into random japanese words repeated. I was using the large model. I ended up fixing it by using the medium.en model and setting condition_on_previous_text to false.

Now if only the timestamp timings were correct for noisy audio... I've tried stable whisper and another fork I forget the name, but I need to run the audio through RTX voice if I want consistent timestamps...

resoluteteeth · on March 30, 2023

I wonder if that's specifically a result of training it on youtube videos where the audio/subtitles don't actually match (recipe videos with no speech that have additional text added in the subtitles)

coffeebeqn · on March 30, 2023

Don’t forget to smash that subscribe button

taberiand · on March 30, 2023

Does Whisper do streaming translation?

I'm imagining the near future will see a portable fast streaming model for real-time voice translation, piped into text to speech. Hook it up to an earpiece and you've got a real-life Babelfish

woodson · on March 30, 2023

Whisper doesn’t do streaming transcription or translation. It’s one of its architectural disadvantages over more commonly used ASR approaches, such as transducers (see e.g. Microsoft Research’s system trained on ~350k hours that does streaming transcription and translation: https://arxiv.org/abs/2211.02499).

speedgoose · on March 30, 2023

Yes it does streaming but you may have to reduce the quality to keep up if your machine is not very fast.

8f2ab37a-ed6c · on March 30, 2023

Whisper is incredible, and it's bananas that they give it away for free. It is able to decipher English with the thickest of accents, often better than me.

abraxas · on March 30, 2023

No, no it isn't.

Source, me with my Eastern Europe accent. There are engines out there that do far better with my accent. FAR better.

8f2ab37a-ed6c · on March 30, 2023

That wasn’t my experience, but that’s an interesting data point, thanks for sharing.

runnerup · on March 30, 2023

It's not even possible for me to verify the accuracy claims of this closed model with limited API access.

I really don't care about closed API models of anything that has a good/usable open source version. Whisper works well enough, I'm never going to follow up on this USM research or use it. The only reason to pay for the API access would be for some super niche language. And if Google is paywalling this only for the few customers who need it for use in terribly under-represented communities ... that's a kind of douchebaggery all its own.

The only reason people are paying for OpenAI's GPT-4 is because there's literally no usable open-source LLM. The instant a "good enough" one exists, OpenAI's revenues will drop by >95%.

Hopefully Google will at least use this in Google Home because it's still bad enough to notice.

saurik · on March 30, 2023

I mean, some people--like myself--are already paying for Google's multi-language speech recognition API, and have been for years, so the idea that there is a new even-better model for it sounds cool to me? My primary annoyance is that this is Google, so of course they aren't going to put in even the minimal effort to just make this a new backend for their existing ridiculously-insane API I had to build a miserable slightly-custom http/2 stack to access :/.

Regardless, I don't want to use the API, but I'm working with public information anyway; and so, while I have considered moving to Whisper now that that's an option, it hasn't been a priority and it isn't clear to me that Whisper is good at random non-English languages anyway.

UncleEntity · on March 30, 2023

> Our model has, on average, a 32.7% relative lower WER compared to Whisper for these 18 languages.

Which probably works out to one less error per thousand words or something crazy like that.

The state of the art is pretty much better than humans at this point iirc.

6gvONxR4sf7o · on March 30, 2023

On the youtube captions dataset, it says english (US) WER is nearly 15%. On those 18 languages, it's nearly 20%. How does that match up with 32% relative being one per thousand or so?

runnerup · on March 30, 2023

In the paper itself[0], they do have a couple comparisons for English-only (Fig. 2, Table 3), and it looks like Whisper has an 11.5% error rate and USM has a 10.5% error rate. It's a truly negligible difference. There's no way I'd ever pay for an API for this if I knew I only cared about English.

I know that's not the point of this model (the point is that for a lot of languages, its the only model available). But paywalling it seems greedy, you'll only extract money from those under-represented communities. On the other hand, maybe this never would have been built without the profit motive. Idk. I wish we could fund these things as "basic science research" without a need for direct profit. Let positive externalities pay us back down the road.

0: https://arxiv.org/pdf/2303.01037.pdf

UncleEntity · on March 30, 2023

Whisper can also be fine tuned on other languages. Don’t know how well it’ll do compared to this but it’s at least a possibility.