Hacker Newsnew | past | comments | ask | show | jobs | submit | more ph4evers's commentslogin

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.


The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):

    queue
    
         The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"


so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!

I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.

The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.


The right way to do this would be to use longer, overlapping chunks.

E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).

This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.


I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription.

https://tomwh.uk/git/whisper-chunk.git/

I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.


If real-time transcription is so bad, why force it to be real-time. What happens if you give it a 2-3 second delay? That's pretty standard in live captioning. I get real-time being the ultimate goal, but we're not there yet. So working within the current limitations is piss poor transcription in real-time really more desirable/better than better transcriptions 2-3 second delay?


I don't know an LLM that does context based rewriting of interpreted text.

That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.

The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.


Attention is all you need, as the transformative paper (pun definitely intended) put it.

Unfortunately, you're only getting attention in 3 second chunks.


Which other streaming transcription services are you referring to?


Googles speech to text API: https://cloud.google.com/speech-to-text/docs/speech-to-text-...

The "alternatives" and "confidence" field is the result of the N-best decodings described elsewhere in the thread.


That’s because at the end of the day this technology doesn’t “think”. It simply holds context until the next thing without regard for the previous information


Whisper is excellent, but not perfect.

I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."


Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.


Thanks John Key Many!


That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.


When she told me her name, I didn't ask her to repeat it, and I got it right through the rest of the call. Whisper didn't, so how is this "at least s good as a human?"


I wouldn't expect any transcriber to know that the correct spelling in your case used a G rather than a J - the J is far more common in my experience. "Jim" would be an aberration that could be improved, but substitution "Jem" for "Gem" without any context to suggest the latter would be just fine IMO.


So, yes, and also no.


That’s been done to see if it could extrapolate and predict the future. Can’t find the link right now to the paper.


This one? "Mind the Gap: Assessing Temporal Generalization in Neural Language Models" https://arxiv.org/abs/2102.01951


The idea matches, but 2019 is a far cry from, say, 1930.


In 1930 there was not enough information in the world for consciousness to develop.


You mean information in digestible form.


I think this is a meta-allusion to the theory that human consciousness developed recently, i.e. that people who lived before [written] language did not have language because they actually did not think. It's a potentially useful thought experiment, because we've all grown up not only knowing highly performant languages, but also knowing how to read / write.

However, primitive languages were... primitive. Where they primitive because people didn't know / understand the nuances their languages lacked? Or, were those things that simply didn't get communicated (effectively)?

Of course, spoken language predates writings which is part of the point. We know an individual can have a "conscious" conception of an idea if they communicate it, but that consciousness was limited to the individual. Once we have written language, we can perceive a level of communal consciousness of certain ideas. You could say that the community itself had a level of shared-consciousness.

With GPTs regurgitating digestible writings, we've come full circle in terms of proving consciousness, and some are wondering... "Gee, this communicated the idea expertly, with nuance and clarity.... but is the machine actually conscious? Does it think undependably of the world, or is it merely a kaledascopic reflection of its inputs? Is consciousness real, or an illusion of complexity?"


I’m not sure why it’s so mind-boggling that people in the year 1225 (Thomas Aquinas) or 1756 (Mozart) were just as creative and intelligent as they themselves are, as modern people. They simply had different opportunities then comparable to now. And what some of them did with those opportunities are beyond anything a “modern” person can imagine doing in those same circumstances. _A lot_ of free time over winter in the 1200s for certain people. Not nearly as many distractions either.


Saying early humans weren’t conscious because they lacked complex language is like saying they couldn’t see blue because they didn’t have a word for it.


Well, Oscar Wilde argues in “The Decay of Lying” that there were no stars before an artist could describe them and draw people’s attention to the night sky.

The basic assumption he attacks is that “there is a world we discover” vs “there is a world we create”.

It is hard paradigm shift, but there is certainly reality in “shared picture of the world” and convincing people of a new point of view has real implications in how the world appears in our minds for us and what we consider “reality”


It should be almost obligatory to always state which definition of consciousness one is talking about whenever they talk about consiousness, because I for example don't see what language has to do with our ability to experience qualia for example.

Is it self awarness? There are animals that can recognize themselves in mirror, I don't think all of them have a form of proto-language.


Llama are not conscious


BCAAS have been debunked countless of times. The only people recommencing BCAAS are selling them.


Yes! I did something similar with daily exercises at https://app.fluentsubs.com/exercises/daily


Congrats on the success! Are you not afraid that MS ships a wiki upgrade at a certain point?


Given the state of the typical Microsoft PM he will be safe. They'll always prefer more features over a fast UX. Even if there will a fast enough teams wiki one day, the next PM will butcher it to death again.


It's more likely that they just acquire Perfect Wiki and integrate it directly.


Ms already has onenote and loop. So another new product is unlikely to be come, let alone compete


Crazy that a reputation without a product can be worth so much.


Yes but the industry is so rooted and vendor locked that it is extremely hard. People pay for Autodesk, Ansys, comsol etc. because it is proven and engineers are trained to use it. I would not be eager to use something new if I’m a constructor or car manufacturer.


Sure, a new startup will never get any market share in large, stable businesses like those. They would have to sell to other startups. New auto parts manufacturers pop up all the time.


Keep going! I love Cursor. Don’t let the haters get to you


Fantastic idea. Curious if this works in my town as well


You can try it.


Love it!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: