Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

agree on this. I wonder how threads like this even get off the ground. anyone who watches about 10 minutes of auto captioned videos can see how awful it is.


That’s funny; I watch a lot of YouTube — mostly in English, mind you, but sometimes in other languages with English auto-translated captions — and I find the quality of YouTube’s auto-captions better than most “professionally”-produced captions, to the point that I would sometimes switch explicit captions off and revert back to auto-captions if I could.

I think this says more about the generally-low quality of the captioning services used by YouTube content creators than anything, though.

My biggest complaint is that I often find that human captioners don’t recognize domain-specific words. Like, names of ethnic foods. The word might be literally showing on the screen while the presenter is talking — and the “professional” captioer will just put [indistinct] there, as if they aren’t even watching the video as they’re captioning it. YouTube’s auto-captions get these words right every time.

My impression is that USM is seemingly uniquely good at code-switching from word to word within a sentence — which makes sense, given its “universality.” I think, if they allowed it, it would even be able to embed clauses and quotations generated in one language and alphabet, into a sentence generated in an entirely different language and alphabet, keeping syntax and grammatical structure correct for the given language within its clause.


I actually feel like the autocaptioning is one of those "I'm living in the future" moments. It's amazing to play a video in Swedish, and just have it autotranslate. I love it. It's not as good in my opinion as good human translation, but I agree that I've seen many translations that were much much much worse than the autocaptioning.

I have had some channels where I was laughing out loud at the autocaptioning, which was probably more so translation, but I did get a laugh out of it after all, and I generally knew what they were saying.

I've also noticed that, at least with the videos I watch, usually the autocaptioning errors seem "phonemically correct", so it's substituting a word that sounds the same, and I can easily figure out what was meant. Usually I've noticed these problems with more so with non-American English (British or Australian for example), especially where there are multiple people all speaking English, but with different accents. It does seem to me the English speech recognition is honed in on some west coast or midwest US English accent.

I am surprised contextual cues aren't being utilized more but I'm very happy with the YouTube speech recognition.


They aren't hiding that, though? Literally the first graphic on the page shows their claim of a "word error rate" on test data has an error rate of around 14% in the best case compared to the state of the art at 15% (for en-US content).

It's only 1% better than the current state of the art. But it's still noteworthy. From the end of the abstract:

> We demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of our model and fine-tuning on a smaller set of labeled data enables us to recognize these under-represented languages. Moreover, our model training process is effective for adapting to new languages and data.

It's amazing to me that the chaotic process of "machine learning" can end up with an internal state for languages that is readily adapted to entirely new languages.

For now, they've got this handling audio transcription, but with some hints that this approach could work well for translation. Perhaps we'll be able to use these improved models to decipher Linear A[0] or other un-deciphered languages. It sounds like "magic", but it's the kind that could maybe exist.

[0] https://en.wikipedia.org/wiki/Linear_A


>> It's amazing to me that the chaotic process of "machine learning" can end up with an internal state for languages that is readily adapted to entirely new languages.

Yeah and interestingly, that was roughly Chomsky's breakthrough finding with respect to how humans learn language as children. We are born with an innate language acquisition device.


Chomsky's ideas around the Universal Grammar are just a theory. Similar to how his formal grammars somewhat represent real languages but never fully, the UG model will never explain it all, or even most of it. Brain biology just doesn't like the idea of formal things/rules/grammars.

Here's an alternative theory/approach. What if natural languages are just the way the device starts working when the number of neurons grows quickly? NL properties sort of emerge out of low-level details of brain work? Neurons are simple but the brain is not. Complex brain properties emerge from trivial parts the same way our full bodies emerge from a simple DNA/RNA system. Any details in these systems would be too statistical to expose a limited rules system.

Obviously, powerful enough ML system can infer the system's properties. In fact, it can infer any function. The thing is that this doesn't mean there's some kind of simpler model explaining details of emergent system's work.

What is surprising is the way LLMs imitate a stateful function (our brain, with memory, fluid biology, etc) using a stateless inferred function (the model). I suspect this statefulness might be the answer to the question of "poverty of stimulus" problem.


The business model is the interesting part - it's proving out OpenAI's model of selling access to a model, rather than insights that Google derived from the model




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: