(disclaimer: I work on speech recognition for mobile devices)
The experiment setup is a little weird here. The participants were given a set of pre-created phrases to type/speech. At least for English, the phrase set [1] contains utterances like
"circumstances are unacceptable"
…which contains rare but in-vocabulary words. That makes it hard for keyboard input (hard for humans to spell long words, hard for touch input to predict unlikely words) but very easy for speech recognition (no other word sounds like "circumstances"). And that test set is so old that it's very likely that the speech recognizer from the experiment (or any state of the art speech recognizer) has already been trained on those sentences.
The utterances being pre-selected is also unfortunate. When users are given a sentence to speak ahead of time, they tend not to hesitate or stutter. They also speak faster than when they're trying to think of something to say on the fly, which is more typical of text input on mobile devices.
All that being said, it's certainly true that you can often input text very quickly with speech recognition, and it's getting better every day. :)
(similar disclaimer: I've done a bunch of text input work for mobile)
That phrase set was explicitly designed for text entry (typing) experiments. While not optimal, it does allow for more direct comparison to a large body of previous studies using the same phrase set (and similar procedures).
Having said that, keyboard input methods that provide suggestions/corrections at multi-character or word level features probably also benefit from longer words since the recognizer has more signal to work with. Revisiting the assumptions that went into that initial phrase set (character at a time input) in light of modern text input techniques might be a good thing.
> That phrase set was explicitly designed for text entry (typing) experiments.
I'm sure that set was designed for <del>typing on full-sized, physical keyboards,</del> not touch-screen mobile devices. (Thanks for the correction!)
Also, even though speech is faster than typing on a touch-screen mobile device, it's a lot easier to correct the errors that inevitably happen via typing.
Sometimes it's impossible to verbally correct errors or enter unrecognized words (or names!).
It does predate capacitive touchscreen keyboards. However it wasn't made explicitly for full size physical keyboards. They explicitly had soft keyboards in mind. Here is the paper that introduced the phrase set [1] at it refers to this soft keyboard they made previously [2].
When I use speach to set reminders, I usually have to prepare mentally really carefully, so I use the exact phrasing for datetime it recognizes without me hessitating.
This. If you don't prepare the correct phrasing you will be saying it at least twice and probably fumbling with the keyboard to get it into the right mode to accept voice again -- waiting for the acknowledgement that it is listening. Voice interfaces are currently incredibly cumbersome. There is no way to reconcile "3x faster" with actual excruciating experience that is voice input.
Edit: one time when it was faster is when I had a string of 4-5 alarms that needed to be set. Of course I used the touch interface for prompting the voice input.
I couldn't make a habit of using speech recognition. It somehow feels weird in public. You need to speak like a news anchor, hold the phone in an unnatural way (never liked video calls due to that too) and keep an eye on the recognition so it's more effort than just instinctively typing something.
I'm still young and I find it annoying as well, mainly because typing is silent and doesn't strain your voice (good after a long day) and I often need to switch between languages and use unlisted jargon or proper nouns a lot.
As a foreigner with strong accent, often typing in multiple languages, speech recognition has always been at best hit and miss.
No sure how the technology has evolved recently, but that's already shameful enough when a cashier don't understand you, I wouldn't want my phone to do the same in public. My pride requires me to avoid speech recognition nowadays.
Bad news though - some countries like Spain have decided to offer more friendly automated phone service replacing "type 0 to whatever" by using speech recognition. That's a nightmare.
Yes,
speech recognition and auto-correct have been more of nuisance than improvement for me.
Speech recognition works most of the time, but I often end up with one word that Siri just doesn't understand and then you end up spending more time trying to figure out how the device wants you to pronounce the word than it helps you saving time by using it instead of just typing it out. There are also some funny videos of foreign speakers trying to use Siri when it first came out - that has been exactly my experience with it as well.
Auto-correct is the other annoying one, which can wreak havoc when you're texting in multiple languages. Even if you manage to setup your device properly with all the different languages, it can still cause some problems whenever you mix the languages within a single message.
However, the biggest problem I've experienced so far is when you send messages between parties where one of them doesn't have the foreign (latin based) language installed/configured yet. The replies I've received from people on their new phones often ranged from funny to cryptic, where you have no idea what they were trying to tell you. They just typed in the foreign text and hit send without checking only to have auto-correct send you something intangible.
My best friend is a writer [1] and back in January we were driving to a convention. He was working on a book at the time and his thought was to use the drive to get some work done. I was expecting him to type, but instead he tried dictating the book to his computer.
"... and that earned him the name of War Thug ... erase previous word ... War Thug ... erase previous word ... WAR ... THUG ... WAR THUG! WAR THUG! WAR THUG! NOT WARTHOG YOU GODDAMNED PIECE [CENSORED]"
It was the most amusing thirty minutes of the drive.
He ended up not dictating the book on the drive.
It was only later that the thought came to me that he should have gone ahead and cleaned up the text later. I'm not sure how successful that approach would be.
Unfortunately speech recognition often produces results so unusable that it might be harder to clean it up later as you wouldn't be able to tell what you were even trying to say.
... or, for the sake of argument, the same thing by speech recognition:
Unfortunately speech recognition often produces insulin usable and it might be hard to clean up later can be just you and me I want to tell what you're trying to say.
Maybe my voice sucks, or maybe I should robot it up more, but this is about the level I usually get... it's okay if you're doing a quick search on Google maps and don't mind repeating it 3 times but anything more than that and you may as well start typing.
Anecdotal but yes sometimes certainly it sucks -- I find it useful when I'm in a hurry, so I'll use Siri and say "Remind me to buy milk, toothpaste, red wire." Most of the time it works fine. Sometimes it cuts short and creates a reminder "Buy," or completely botches the words. I've been left standing clueless in the aisle as I'm trying to decipher what Siri should've heard. Especially true in my native language, Swedish.
And that's another issue. "Well of course speech recognition works* (mostly, kinda, in quiet environment, and in about 10 major languages). What, you want to speak your language? Go away - it's your own damn fault for not speaking English or Chinese like normal people do!" For most languages, this is a non-problem with a keyboard.
Speaking from a Chinese context this is very common. The irony to me is that spoken Mandarin is very fast and very colloquial. This coupled with the limited number of phonemes makes it very easy to transcribe speech incorrectly (i.e. not inline with the speaker's intention). The irony is that I've observed that most Mandarin speakers will slow down their speech and speak more clearly to try and ensure proper transcription.
This is amazing to me because most people I encounter don't do that for me, even though I am not a native speaker and by doing that they could ensure that our communication is smoother and more accurate.
If we're looking at the rest of these anecdotal comments as a reflection of the larger English-speaking (and perhaps American) public, the conclusion is that speech-to-text isn't very helpful but in a completely different linguistic and cultural environment (i.e. Chinese) we see that it is indeed quite successful.
One interesting thing to me is that, with sucessful enough voice-to-text transcription, the need for typing (and writing) becomes moot (in that context). I wonder if we'll see that typing as only a skill necessary for specific trades/occupations.
I'm old, and I'm with you. While the Amazon Echo has been way more useful to me than I would've thought because of its speech-interface, so I'm not surprised at the speed or accuracy of speech recognition. It just feels weird. If I wanted to talk to someone, I would call them. Of course, there are many chat situations (the recipient can't talk on the phone, or you're group chatting) that can't be executed via phone call. But I like the tactile experience of typing in a message in those situations. Feels more deliberate. Also, a lot easier (I would think) to insert multimedia and emojis.
However, I'm not old enough that I enjoy the tactile experience of handwriting a letter for the tradeoff in speed, though that's because my handwriting is so terrible.
This must be partly cultural. I'm currently in China, and a ton of people who use WeChat speak into their phones to send voice messages, rather than type. They hold the phone so that the phone's mic is close to their mouth. When they listen to voice messages, you can see them holding their phones up to their ears to hear their phone's speakerphone clearly. And yet there I am, mostly typing my WeChat text messages instead of sending voice messages. Though I sometimes hold my phone's speakerphone up to my ear to hear someone else's voice message....
The problem I have with speech recognition is it doesn't allow for contemplation and correction (easily). I can stop mid sentence, go back, change something, and finish the sentence when I'm writing. Much harder to do in speech recognition. Particularly with phone stuff, it's better with Dragon, etc., but for composition, writing is still the fastest way for me to get something completed.
Dictation is a bit of a performance. It's not composition. Although it can make you a better thinker, I believe, if you practice it.
Charles Krauthammer is, in my opinion, a pretty good speaker. Not many pauses, rarely any filler like "uh" "um" etc. I was googling around about him, and it turns out he was asked about this. He's a medical doctor, and back in the day, they would dictate their notes over a phone in the hospital, to a recording that would later be transcribed.
After doing this long enough, he became skilled at composing his thoughts without writing them down.
Listening to yourself makes a HUGE difference, in my experience. I've gotten used to recording myself for things like singing lessons, and it's been perhaps even more beneficial to replay the casual conversational exchanges with my teacher, as the singing itself.
In my case, I learned that I tend to speak in very quick bursts - get a thought out at a rapid clip, followed by a brief pause. Actively trying to measure comes across as far more intelligent I think, and I have time to think through what I'll say next.
Really hearing how many pauses, contractions, etc. you include in your regular speech helps you improve dramatically
It works really well for a first draft if you just ignore what was already there. You can add notes / metadata, but just keep talking as you go. Mentally I think speaking just sits in another area as people who have trouble writing can act out a story with hand puppets for their kids in real time.
At the end you end up with something that takes minimal effort to get into a readable first draft state. And after that it's just editing.
When the speech recognition experience becomes good enough, the awkwardness —both individual and social— will dissipate. For example: at the moment, the entry and exit points for speech recognition are kludgy at best. Once your many devices operate as a single entity, the entity recognizes your (and only your) voice, the entity simply performs on command, and a reliable and comfortable/natural exit point is designed, so much of the weirdness will just disappear.
With a lot of text entry that I tend to do in public - work-related email/Slack, texts to my partner, etc - I get a reasonable amount of privacy by typing vs. speaking. Even if I had access to a super-human speech recognizer, I would still feel really uncomfortable dictating my texts to friends.
I came to the conclusion that speech recognition is more efficient at input on my cellphone months ago, but could not bring myself to make the habit either.
I only do it when I want to send a message in a hands free manner, such as a text message saying that I am stuck in traffic.
Recently (past 2 years, even less), the speech recognition tech has gotten much better in my experience. I can ask it things like I would ask a human, and it works.
I would try to use it again, but this time not concentrating as much on the speech, and assuming success.
Disclaimer: This works especially well for things like setting timers and reminders, but I don't really use it for replying to texts.
I'm getting comfortable with talking to my watch. Its pretty discreet if you do it correctly. The mics on these things are very sensitive, so you don't have to yell. In fact, whispering works just fine. The problem is that no one tells people this and as we have seen from the popularity of Siri and other tech that people just automatically start yelling commands like they're using a computer from a 1950s sci-fi movie.
Even in places with loud background noise, its a non-issue considering everything nowadays has two mics for noise reduction. I'm very, very surprised at how well Google voice-to-speech works. When it fails its almost always because I have a poor signal from t-mobile or am saying something that's just too difficult for machines to parse correctly.
This has been my problem with talking to the watch or phone. It is usually the internet in the area that is the problem.
Also, one thing that is relatively hard for voice recognition to distinguish is varying between two different languages. I am always cooking and words like, dashi, kombu, and gnocchi are hard to parse. There are uses of words from other languages that don't involve saying "translate x in English"
By far my favorite thing about my smart watch is the ability to dictate texts.
I find, like some commentors above, that for longer-form composition, I often want to skip around. But for a text-length message the time to compose is drastically reduced by dictating it to my wrist rather than removing phone, unlocking phone, typing reply, etc. And it can be done largely hands-free.
I understand feeling weird speaking to your phone in public but I'm not sure I follow "hold the phone in an unnatural way". When I use speech recognition, I don't move my phone from the usual position.
I have a Moto X 2016 and it has a cool feature: simply lift the phone to the ear, as you would talk, and it starts the mic+recognition.
Then I just talk my question as I would have a conversation with a human and I get the answer spoken into the earpiece. Of course, sometimes I need to look up the info on the screen, since it doesn't answer straight.
I'm in my mid-50s, and I use speech recognition all the time on my phone. Multiple times per day on the average. I do have to speak like a news anchor, I don't hold the phone in a natural way, and I find it invaluable for productivity reasons. I don't give a shit if people think I'm holding my phone funny, or if they think it's weird to talk quietly into my phone. Life's too short for that. This answer was dictated into my iPad, by the way
It's the exact reason for me why I'll probably never use speech recognition for anything more than trying it out. I neither want to be the weird person talking to their computer/phone, nor do I want to annoy people, nor do I have any interest in sharing everything I do with everyone in hearing range.
In public I hold the mic close to my mouth and lowtalk into it, no need to bring out the transatlantic accent radio voice.
I can hold a coffee with one hand, see where I'm going with my eyes, and still enter text. It became reliable enough for me to use habitually a couple years ago, and it keeps getting better.
This is awkward for me in public as well, but I found that the only way to get any (light) work done on days when I'm also taking care of my baby at home, is by dictating. Suddenly I can reply to emails and quick asks from coworkers, and still be present.
I'm old, and I feel the same way, but: I'm trying to take advantage of the technology more and more, despite my social quirks of using it, since I can no longer use my phone without excruciating thumb pain.
That must be a local thing. The only time I ever see people talking on bluetooth headsets in public, it's the older guys who think they're cool because they're using expensive tech. Everyone else just holds the phone up to their ear.
I think it just means you have a bit of empathy, and awareness of how other people might view you. Many, if not most people seem to lack those qualities in general, never mind in relation to their phones.
On the topic, something which reads your subvocalizations is really needed, and could even increase the speed!
Based on what they show on the video alone, I feel this study is unfair. They are comparing "one of the best commercial speech recognizers out there" and in absolutely ideal conditions (no external noise, no echo etc.) against a normal on screen keyboard with no prediction enabled. I'm no pro and I can type much faster than that with SwiftKey.
I'm not saying the study doesn't have merit, but "Speech _Is_ 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices" sounds a bit of a stretch.
Agreed. I type 3-4x faster than most using Swype on Samsung phones. I have become so used to it I can also type without looking at the screen in most cases.
To provide a counterexample, I definitely type less than 50 (their median) wpm on a mobile device. More like 10 - especially if there are symbols involved that are not on the first two screens. On blackberry maybe 50.
On the other hand, speech is a lot worse. A lot of times you can't afford to have 10%, 5% or even 1% error rate since messages are usually short and you cannot infer intended meaning easily. So my WPM accounting for correcting speech with swype is <10.
It's silly they only compared it to a normal on-screen keyboard and not any other input methods, predictive or not (Swype, etc.).
It should be also be noted that their speech tests were done in a controlled, silent environment. I'd expect the error rate and time to complete a phrase would dramatically increase in a noisy room.
What this says is "software keyboards suck." That's the elephant in the room. "Better than utter crap" does not mean "wonderful." I do wonder how the SR test would stack up against a hardware keyboard - a device-sized one, and a full-scale one.
(Well of course on-screen keyboards suck: they're a skeuomorphic ugly hack that has been bolted onto a touchscreen. With a slideout hardware QWERTY keyboard on an Xperia Pro, I was typing slower than on a full-sized kb, but still several times faster than any onscreen input - predictive or not, swipe or not.)
There's zero tactile feedback on software keyboards and the way most people are using them is basically the good old "hunt and peck" style that's the least efficient way to use a hardware keyboard. I'm not sure the results would be any better or worse on a desktop on-screen keyboard with mouse input.
Speech-to-text seems like a technology that suffers from the 9x effect[1]. Creators overvalue the impact of voice transcription, and users overvalue their existing input options.
Even for the desktop, speech will be roughly 2X faster than typing, but I have no desire to buy a copy of Dragon because I like/overvalue my keyboard.
I think it could catch on.. if speech to text worked correctly every time.
If I hit a key on a keyboard, it works correctly every time. If I make a typo, it take a fraction of a second to hit backspace and correct it. Mistakes are cheap to correct.
In comparison, if I talk to my phone, it makes a significant number of mistakes.. making me talk slower/louder/differently to use it. And when it does make a mistake, it's more costly to correct.. I either have to start over, or pick it up and hit backspace. So I disable it and never use it.
This is the issue I have with speech to text systems as well. None of them have an intuitive, reliable way of making corrections that I don't have to painstakingly learn. Also, Google's keyboard (like SwyfteKey) is far far faster than typing normally. Amusingly, it also suffers from poor correctability...
> It will not catch on unless a privacy device is made so that people in office enivonements csn do it without disturbing their co-workers.
The privacy device is called "an individual office", and its pretty important for people in "office environments" to be effective, independently of use of voice recognition.
They've been around for quite a while, but their popularity goes up and down periodically.
I cannot find the article at the moment, but there was one discussed on hacker news recently that mentioned a privacy device for telephones before the 1950s that achieved the same effect as cupping your hands over the receiver so that others in the room could not hear what you were saying into it. If programming by voice takes off, I imagine such a thing would be a necessity to keep office environments sane. The same goes for regular text input by voice.
> cannot find the article at the moment, but there was one discussed on hacker news recently that mentioned a privacy device for telephones before the 1950s that achieved the same effect as cupping your hands over the receiver so that others in the room could not hear what you were saying into it. If programming by voice takes off, I imagine such a thing would be a necessity to keep office environments sane.
Or maybe we'd all get private offices with decent sound-proofing.
Well, that person suffers from RSI, so there's a certain group of people who would benefit from it. Here are a group of resources if anyone is interested:
Never tried to program with voice, but I did set up Dragon and macros so I could dictate javadoc. That worked pretty well. Since a lot of the language in javadoc is fairly regular, I could work pretty quickly.
According to Wikipedia, a "comfortable" speed for audiobooks is 150-160 wpm, whereas the "average professional typist" achieves 50-80 wpm [0]. So 2x seems to be fairly accurate for the average user. Real-time transcriptions in court rooms etc. are usually done with stenotype machines [1].
Are you comparing the polished output (audiobooks) to raw input (typing)? That's about as unfair a comparison as you could possibly make: I'm positive (from experience) that the audiobooks undergo massive retakes and sound editing. What you hear is not a person narrating the whole thing in a single take; the raw input rate (number of words / time spent in front of the mic) is about 20 wpm, if you're really really lucky (you will throw away a lot of material). And halve that for edits, which are unavoidable. Plus the recording happens under ideal conditions, with full and perfect focus on pronunciation.
TL;DR: audiobooks are almost completely unlike voice recognition, you're comparing apples and oblique angles.
I'm with you on this one. I had to set a reminder to "Get the soy milk", and Google's speech recognition picked up "Get this white milk". I knew what it meant so I just saved the reminder instead of correcting it. Unfortunately I cannot rely on speech to send texts, dictate emails, or have any other interaction with other people.
It's surprising that they don't show you a screen with the transcription and let you easily select bits you want to re-transcribe.
Text selection on phones usually sucks, but that's only because you need to disambiguate between clicks and highlighting; if you had a dedicated screen that assumes a user wants to highlight a section it should be a lot more responsive.
I found speech recognition to be useful mostly for brain dumps. I've found I tend to think best when explaining or talking. I used to bring along a voice recorder on long drives, capture what I was thinking, then run it through voice recognition later. It was often a garbled mess, but usefully captured a lot of thinking. Modern recognition systems would do a lot better. Yeah, I should try that again.
This is a good point. I've been using the same strategy for taking notes in meeting where there is lots of talk. Just write down stuff, not concentrating on format or content. Then after the meeting go through the notes and clarify. Trouble is, writing even quick notes, is difficult to do if you also need to talk and think at the same time.
Probably you could build some neat solution around this. Like a speakerphone style device with array of microphones to make it easier to identify who is speaking and pick up the words. Or maybe a regular smartphone is enough. The device/app would then just transcript what is spoken and annotate it with names.
Compared to audio recording the benefit would be that going through the written raw stuff is much faster and if you were present, then you can probably recall the stuff even if the transcription is not perfect. Also it might be more socially acceptable to use this solution that to record the meetings.
Czech out my project, Spreza. We're not targeting voice input, per se, but instead automated transcription. Nonetheless, the correction issue persists in both use cases. We solve it via a homonym suggestion drop-down box. Double-click an error and replace with the correct word.
But is it 3x as valuable? Anyone can talk a mile a minute and let the words flow out of their face-hole as fast as their brain can string words together.
Slowing down to think about and review what you're communicating is a feature, not a bug, of text.
How is that even relevant? It's not audio vs. text, it's speech vs. typing. Just use speech-to-text, and you can see your text on the screen before you give it to Siri or your friend. Still better than typing.
While voice commands may be better than using the keyboard, they're still distracting and dangerous. It is best to avoid interacting with your phone at all while you drive.
I think a world of all-speech interfaces would be flawed (if that is the natural conclusion of this line of thinking). I may be able to speak 3x faster than I type, but I read 10x faster than I can listen to someone speak. Speech-to-text is good for typing but that is not to me analogous with the idea that it should supersede visuals.
Why would embracing one destroy the other? I think a combination of gestural interface, subvocalization voice recognition, and AR could be the big winner in our lifetimes. The text won't be bound to a screen, you don't give anything up, you just gain.
When you need to get into serious writing or bulk data entry, maybe it would be a keyboard.
Soldiers often use throat microphones so you can speak so quietly its silent.
I wonder if a front facing camera on a phone can capture your throat in sufficient detail to decipher what you say even if you speak silently if you hold the phone in your palm as people do when browsing?
Here's something positive: This kind of research might indirectly make people healthier and save a lot of lives.
Calorie tracking is a proven way to lose weight, but all the tracking apps that I tried feel like they take too much effort. So I came up with an idea a few years ago: you should be able to just say what you had eaten out loud, and then use speech-to-text and NLP to search for each item and count up the calories.
It works amazingly well. Not only is speech 3x faster than typing, it's also much faster to have a free-form text field that is automatically parsed. And they integrate with Amazon Echo, which I look forward to trying out.
I've been thinking about many other ways to automate calorie tracking. For a while I thought the answer would be an AI that recognizes photos of food, but that doesn't feel important any more. I think speech-to-text takes approximately the same amount of time as opening the camera and snapping a photo. There have a been a few minor errors with Apple's voice dictation, but so far I've seen 100% accuracy from the actual text searches.
So anyway, speech-to-text research. It's all important.
Although I kind of wish Apple hadn’t baked speech recognition into their parsing algorithm. Not everything I say will be in their dictionary, and there are times I’d rather just be able to type the query in myself than have Siri misunderstand it and then have to manually retype the parts that were misinterpreted.
One of the amazingly dumb things about my Nexus 6 is that when it's listening for my voice it still plays incoming SMS message tones that it then takes in as part of speech recognition and messes it up. Why? Really, just why?
I am a native English speaker and a second language Mandarin speaker who learned at early adult age.
I personally consider typing English and Mandarin to be very different. There are linguistic, psychological, and cultural issues at play.
Firstly, the dominant Chinese input system is phonetic (pinyin), which perhaps implies some kind of different mental state when typing.
Secondly, it is the case for adult learners like me but also reportedly for many native speakers that precise Chinese characters are easy to forget. People have visual memories of 3-10,000 characters, but perhaps can write confidently as little as half of them from memory. The phonetic input system presents a context-based suggested intent shortlist and the user is requested to select the character they intended.
Sometimes, in more extreme cases, particularly for native speakers with heavy accents or new second language speakers, users may be unsure which character to select or may even input an incorrect but close phoneme, scan for visual recognition of the correct character, fail to find it, then type a different phoneme.
Frequently, typing Chinese is the only major creative interaction that Mandarin speakers have with modern Chinese text, since writing is becoming increasingly rare outside of a school or government-form context.
Yeah and it's even slower to listen to. I still have a 50+ second long recording from someone that I haven't bothered to listen to, because if they're too lazy to type it out, why should I be expected to wait nearly a minute to listen to it?
Whenever I try to use speech recognition on my Android phone, it's frustrating because when it recognizes the wrong word, there's no way (that I know of) to delete a word or move the cursor.
Also when I try to put a period in (we call them "full stops" in Australia), half the time it inserts a period, and the other half it literally writes "period".
I would like to see some documentation on how to use it better, I had a go at googling for some a while back but couldn't find anything useful. I ended up with the impression that such editing commands weren't implemented.
I still have a go every now and then to see if its improved with updates, but it's basically unusable in its current state.
Unfortunately, speech recognition has been used for mass surveillance and is likely to be abused in the future. There needs to be a hardware control on microphones that allows the external user to control whether they are being listened to.
I agree 100%. Right now we don't even know the number of microphones our devices have, much less the contexts in which they may be active. Speech recognition protocols should also integrate low-level encryption standards to ensure access is only provided to trusted parties.
Is this surprising to anyone? I can speak way faster than I can type even under the best of conditions, let alone on a shitty mobile keyboard. I'm surprised it's only a factor of 3. I tend to avoid writing stuff on mobile and wait until I get home, because it just feels so painfully slow compared to a real keyboard.
I see a bunch of comments disputing the result. To those people, do you really think that typing on mobile is as fast or faster than speaking?
It's comparing the two as viable inputs on a smartphone. Speech recognition is far from perfect and doesn't yet allow you to talk like you would to a human. Did you read the article?
I can type anywhere but there are a lot of places where speech is impossible/inconvenient. I wouldn't want to work in an office with everyone shouting commands at their computer. Audio feedback doesn't work if I'm listening to music from another device. The computer can't here me if I'm playing music, etc.
I have to type a single number into Google Maps for it to auto-complete to my home address. That's not really a beatable speed.
And there is the tiny problem that I can't even pronounce some of the things I've typed into Google. Examples include: Japanese manga names, company names, and foreign names from news articles.
From the paper, the IST on the graph stands for "Initial Speech Transcription", and those data points are for the speech-input text before any corrections were made. The other "Speech" data-points include time to make corrections either by using the keyboard or using speech recognition.
Error correction is more intuitive for typing - typing errors are usually obvious transpositions or missing characters which our brains easily correct. When a speech recognition engine makes a mistake, it's less obvious to the user and usually more jarring and the word used is an actual word but just wrong in context. This makes it more difficult for the user to decipher. So every if it is more accurate or faster, the cost of inaccuracy seems higher.
People will never do it, it's like saying control your phone with your genitalia in public. I don't think it'll break social bounds before better methods are available.
But at the end of the day typing speed doesn't matter. It's not whats limiting us (In English, not all languages)
But I think many people equate speak recognition with language parsing which is why people seem to be obsessed with it.
As an aside, I use 9-key swipe input for Japanese and it's really great once you're used to it. I feel like there's still some way to input English we haven't found yet which really suits the language..
I used to like LibreOffice/OpenOffice.org predicive input and I still sometimes wonder why Windows doesn't provide system-wide text prediction like we have on mobile devices.
I once saw someone using a "radial" keyboard on an Android device. Unfortunately at the time I saw it, the company had vanished.
I'm using SwiftKey (the non-online version), and I can type pretty quickly after the keyboard has been "trained". I think a combination of prediction and a better keyboard layout would help input speed quite a lot.
I just tried Google voice and was surprised hour fast it was. The only problem for me read with punctuation. It always seems to type the word rather than using the correct punctuation mark.
So far this has taken me 1.25 minutes using swipe input on Google b keyboard. Couple of errors and issues (probably because i use my fat thumbs for entry).
I I just trying to Google Voice and was surprised how fast it was. The only problem for me was with punctuation. It always seems to type the word, rather than using the correct punctuation mark. So far this has taken me 37 seconds using voice input from Google keyboard period couple of errors and issues open bracket probably because I use my fat from the entry for my voice close bracket period
I'll have to try when I get home, but I'm curious how speech recognition will do against the message chosen for the touchscreen record. I'm not 100% sure how to even pronounce a couple of the words:
http://www.guinnessworldrecords.com/news/2014/5/fastest-touc...
I wonder what it is with a normal keyboard? I think I can type faster than I can talk. I never learned the keyboard, but my fingers know where all the keys are. I can't tell you the order of letters of the qwerty keyboard, but my brain knows where all the keys are.
You can't type faster than you can talk. Trained stenotypists come close, but there's nobody who can consistently hit 130-150 wpm on a QWERTY or DVORAK keyboard.
You can - if 150 wpm is what you speak, with Plover and training you can reach 120-225 wpm on a qwerty keyboard with NKRO capability. But it's indeed steno (on a regular keyboard). With regular strokes and autocomplete you might reach 70 wpm or more
https://www.youtube.com/watch?v=Wpv-Qb-dB6g
When I'm at home or in my car I use speech to write text messages and perform google searches all the time. It's so much more convenient now that the voice recognition is so reliable these days :)
It would be great if I can register a voice for certain word, regardless of language.
e.g. pawned => pwned, snafu => snafu, and some other language pronunciation to some foreign word
Hmm...for me typing AND speech recognition on mobile devices are horrible. To know that one is slightly more horrid than the other is small consolation.
The experiment setup is a little weird here. The participants were given a set of pre-created phrases to type/speech. At least for English, the phrase set [1] contains utterances like
"circumstances are unacceptable"
…which contains rare but in-vocabulary words. That makes it hard for keyboard input (hard for humans to spell long words, hard for touch input to predict unlikely words) but very easy for speech recognition (no other word sounds like "circumstances"). And that test set is so old that it's very likely that the speech recognizer from the experiment (or any state of the art speech recognizer) has already been trained on those sentences.
The utterances being pre-selected is also unfortunate. When users are given a sentence to speak ahead of time, they tend not to hesitate or stutter. They also speak faster than when they're trying to think of something to say on the fly, which is more typical of text input on mobile devices.
All that being said, it's certainly true that you can often input text very quickly with speech recognition, and it's getting better every day. :)
[1] http://www.yorku.ca/mack/PhraseSets.zip