Wake words are different from "listen to everyhing until name is called". A wake work is needed for both privacy and technical reasons -- you can't just have alexa beaming everything it hears to amazon. So instead it uses a local lightweight "dumb" system to listen to specific words only.
That's exactly why there's massive latencies between command recognition, processing, and execution.
Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"
Sure OK, maybe it's a beneficial side effect then. However you look at it, trying to get the computer to decide when you are addressing it, without using a name of some sort, could be a very challenging problem to solve, one that even humans struggle with. Surely you've been in a situation where you say something to a room and multiple people think you're talking to them? To borrow an example from elsewhere in the thread, if you say "turn on the lights", are you talking to the computer controlling the room lights, or the human standing next to the Christmas tree?
> Imagine if it had sub-ms response to "assistant, add uuh eggs and milk to the shopping list... actually no just eggs sorry"
Could you elaborate on that? What if that were true?
This is one advantage of a system with a constrained set of commands/grammars, as opposed to the Alexa/Siri model of trying to process all arbitrary text while in active mode. It can simply ignore/discard any invocations which don't match those specific grammars (and no need to wait to confirm that the device is awake).
"Computer, turn lights to 50%" -> "turn lights to fifty percent" -> {action: "lights", value: 50}
"My new computer has a really beefy graphics card" -> "has a really beefy graphics card" -> {action: null}
I have a coworker that set up an Alexa an year or so ago, I don't know what was the issue, but it would jump into Teams meetings after every noise in his house.
Sure, if the system is set up to only respond to very specific commands that humans would not respond to, I guess that could work. I was thinking more about the other way around, where a person might speak to someone else in the room and be overheard and acted upon - "turn on the lights!" could be a command for the computer controlling the room, or the human standing next to the Christmas tree, for example.
I’ve never had Alexa control a device via a TV show’ audio but playing back a video of me testing my home automation (“Alex, do X”) triggered my lights.
I’d love a no-wake-word world where something locally was always chewing on what you said but I’m not sure how well it would work in practice.
I think it would only take 1-2 instances of it hearing “Hey, who turned off the lights?” in a show turning off my lights for real (and scaring the crap out of me). Doctor Who isn’t particularly scary but if I was watching Silence in the Library and that line turned off my lights I’d be spoked and it would take me a hot minute to realize what happened.
Even humans struggle with this one - that's what names are for!