Speech Is 3x Faster Than Typing for English and Mandarin on Mobile Devices

gok · on Aug 31, 2016

(disclaimer: I work on speech recognition for mobile devices)

The experiment setup is a little weird here. The participants were given a set of pre-created phrases to type/speech. At least for English, the phrase set [1] contains utterances like

"circumstances are unacceptable"

…which contains rare but in-vocabulary words. That makes it hard for keyboard input (hard for humans to spell long words, hard for touch input to predict unlikely words) but very easy for speech recognition (no other word sounds like "circumstances"). And that test set is so old that it's very likely that the speech recognizer from the experiment (or any state of the art speech recognizer) has already been trained on those sentences.

The utterances being pre-selected is also unfortunate. When users are given a sentence to speak ahead of time, they tend not to hesitate or stutter. They also speak faster than when they're trying to think of something to say on the fly, which is more typical of text input on mobile devices.

All that being said, it's certainly true that you can often input text very quickly with speech recognition, and it's getting better every day. :)

[1] http://www.yorku.ca/mack/PhraseSets.zip

kentlyons · on Aug 31, 2016

(similar disclaimer: I've done a bunch of text input work for mobile)

That phrase set was explicitly designed for text entry (typing) experiments. While not optimal, it does allow for more direct comparison to a large body of previous studies using the same phrase set (and similar procedures).

Having said that, keyboard input methods that provide suggestions/corrections at multi-character or word level features probably also benefit from longer words since the recognizer has more signal to work with. Revisiting the assumptions that went into that initial phrase set (character at a time input) in light of modern text input techniques might be a good thing.

kbutler · on Aug 31, 2016

> That phrase set was explicitly designed for text entry (typing) experiments.

I'm sure that set was designed for <del>typing on full-sized, physical keyboards,</del> not touch-screen mobile devices. (Thanks for the correction!)

Also, even though speech is faster than typing on a touch-screen mobile device, it's a lot easier to correct the errors that inevitably happen via typing.

Sometimes it's impossible to verbally correct errors or enter unrecognized words (or names!).

kentlyons · on Aug 31, 2016

It does predate capacitive touchscreen keyboards. However it wasn't made explicitly for full size physical keyboards. They explicitly had soft keyboards in mind. Here is the paper that introduced the phrase set [1] at it refers to this soft keyboard they made previously [2].

[1] http://www.yorku.ca/mack/chi03b.html [2] http://www.yorku.ca/mack/p25-mackenzie.pdf

geon · on Sept 1, 2016

When I use speach to set reminders, I usually have to prepare mentally really carefully, so I use the exact phrasing for datetime it recognizes without me hessitating.

daveguy · on Sept 1, 2016

This. If you don't prepare the correct phrasing you will be saying it at least twice and probably fumbling with the keyboard to get it into the right mode to accept voice again -- waiting for the acknowledgement that it is listening. Voice interfaces are currently incredibly cumbersome. There is no way to reconcile "3x faster" with actual excruciating experience that is voice input.

Edit: one time when it was faster is when I had a string of 4-5 alarms that needed to be set. Of course I used the touch interface for prompting the voice input.

eknkc · on Aug 31, 2016

I couldn't make a habit of using speech recognition. It somehow feels weird in public. You need to speak like a news anchor, hold the phone in an unnatural way (never liked video calls due to that too) and keep an eye on the recognition so it's more effort than just instinctively typing something.

Am I getting old or is this common?

Bakary · on Aug 31, 2016

I'm still young and I find it annoying as well, mainly because typing is silent and doesn't strain your voice (good after a long day) and I often need to switch between languages and use unlisted jargon or proper nouns a lot.

ekianjo · on Sept 1, 2016

And typing is much better to keep the stuff you say limited to whoever you are sending it, not to people standing around you.

gutnor · on Aug 31, 2016

As a foreigner with strong accent, often typing in multiple languages, speech recognition has always been at best hit and miss.

No sure how the technology has evolved recently, but that's already shameful enough when a cashier don't understand you, I wouldn't want my phone to do the same in public. My pride requires me to avoid speech recognition nowadays.

Bad news though - some countries like Spain have decided to offer more friendly automated phone service replacing "type 0 to whatever" by using speech recognition. That's a nightmare.

OSButler · on Aug 31, 2016

Yes, speech recognition and auto-correct have been more of nuisance than improvement for me.

Speech recognition works most of the time, but I often end up with one word that Siri just doesn't understand and then you end up spending more time trying to figure out how the device wants you to pronounce the word than it helps you saving time by using it instead of just typing it out. There are also some funny videos of foreign speakers trying to use Siri when it first came out - that has been exactly my experience with it as well.

Auto-correct is the other annoying one, which can wreak havoc when you're texting in multiple languages. Even if you manage to setup your device properly with all the different languages, it can still cause some problems whenever you mix the languages within a single message. However, the biggest problem I've experienced so far is when you send messages between parties where one of them doesn't have the foreign (latin based) language installed/configured yet. The replies I've received from people on their new phones often ranged from funny to cryptic, where you have no idea what they were trying to tell you. They just typed in the foreign text and hit send without checking only to have auto-correct send you something intangible.

spc476 · on Sept 1, 2016

My best friend is a writer [1] and back in January we were driving to a convention. He was working on a book at the time and his thought was to use the drive to get some work done. I was expecting him to type, but instead he tried dictating the book to his computer.

"... and that earned him the name of War Thug ... erase previous word ... War Thug ... erase previous word ... WAR ... THUG ... WAR THUG! WAR THUG! WAR THUG! NOT WARTHOG YOU GODDAMNED PIECE [CENSORED]"

It was the most amusing thirty minutes of the drive.

He ended up not dictating the book on the drive.

It was only later that the thought came to me that he should have gone ahead and cleaned up the text later. I'm not sure how successful that approach would be.

[1] https://seanhoade.com/books-by-hoade/

ultramancool · on Sept 1, 2016

Unfortunately speech recognition often produces results so unusable that it might be harder to clean it up later as you wouldn't be able to tell what you were even trying to say.

... or, for the sake of argument, the same thing by speech recognition:

Unfortunately speech recognition often produces insulin usable and it might be hard to clean up later can be just you and me I want to tell what you're trying to say.

Maybe my voice sucks, or maybe I should robot it up more, but this is about the level I usually get... it's okay if you're doing a quick search on Google maps and don't mind repeating it 3 times but anything more than that and you may as well start typing.

toxik · on Sept 1, 2016

Anecdotal but yes sometimes certainly it sucks -- I find it useful when I'm in a hurry, so I'll use Siri and say "Remind me to buy milk, toothpaste, red wire." Most of the time it works fine. Sometimes it cuts short and creates a reminder "Buy," or completely botches the words. I've been left standing clueless in the aisle as I'm trying to decipher what Siri should've heard. Especially true in my native language, Swedish.

Piskvorrr · on Sept 1, 2016

And that's another issue. "Well of course speech recognition works* (mostly, kinda, in quiet environment, and in about 10 major languages). What, you want to speak your language? Go away - it's your own damn fault for not speaking English or Chinese like normal people do!" For most languages, this is a non-problem with a keyboard.

ajuc · on Sept 1, 2016

It would maybe help if there was easy way to mark a word for fixing later.

kaptain · on Sept 1, 2016

Speaking from a Chinese context this is very common. The irony to me is that spoken Mandarin is very fast and very colloquial. This coupled with the limited number of phonemes makes it very easy to transcribe speech incorrectly (i.e. not inline with the speaker's intention). The irony is that I've observed that most Mandarin speakers will slow down their speech and speak more clearly to try and ensure proper transcription.

This is amazing to me because most people I encounter don't do that for me, even though I am not a native speaker and by doing that they could ensure that our communication is smoother and more accurate.

If we're looking at the rest of these anecdotal comments as a reflection of the larger English-speaking (and perhaps American) public, the conclusion is that speech-to-text isn't very helpful but in a completely different linguistic and cultural environment (i.e. Chinese) we see that it is indeed quite successful.

One interesting thing to me is that, with sucessful enough voice-to-text transcription, the need for typing (and writing) becomes moot (in that context). I wonder if we'll see that typing as only a skill necessary for specific trades/occupations.

danso · on Aug 31, 2016

I'm old, and I'm with you. While the Amazon Echo has been way more useful to me than I would've thought because of its speech-interface, so I'm not surprised at the speed or accuracy of speech recognition. It just feels weird. If I wanted to talk to someone, I would call them. Of course, there are many chat situations (the recipient can't talk on the phone, or you're group chatting) that can't be executed via phone call. But I like the tactile experience of typing in a message in those situations. Feels more deliberate. Also, a lot easier (I would think) to insert multimedia and emojis.

However, I'm not old enough that I enjoy the tactile experience of handwriting a letter for the tradeoff in speed, though that's because my handwriting is so terrible.

PakG1 · on Sept 1, 2016

This must be partly cultural. I'm currently in China, and a ton of people who use WeChat speak into their phones to send voice messages, rather than type. They hold the phone so that the phone's mic is close to their mouth. When they listen to voice messages, you can see them holding their phones up to their ears to hear their phone's speakerphone clearly. And yet there I am, mostly typing my WeChat text messages instead of sending voice messages. Though I sometimes hold my phone's speakerphone up to my ear to hear someone else's voice message....

edit: clarification on holding phone

yitchelle · on Sept 1, 2016

Why not make a phone call instead of sending a voice message?

ragazzina · on Sept 1, 2016

Because it's asynchronous?

Normal_gaussian · on Sept 1, 2016

I briefly worked with a Russian who would do this, and the asynchronousity was the reason he gave.

It was super annoying to be around him when he did it; at least it was Russian, if it was English it would be very distracting.

MBCook · on Aug 31, 2016

I'd never do it in public, but I use Siri (especially the dictation engine) ALL THE TIME when I'm alone.

DigitalJack · on Aug 31, 2016

yeah, this basically is me too.

The problem I have with speech recognition is it doesn't allow for contemplation and correction (easily). I can stop mid sentence, go back, change something, and finish the sentence when I'm writing. Much harder to do in speech recognition. Particularly with phone stuff, it's better with Dragon, etc., but for composition, writing is still the fastest way for me to get something completed.

Dictation is a bit of a performance. It's not composition. Although it can make you a better thinker, I believe, if you practice it.

Charles Krauthammer is, in my opinion, a pretty good speaker. Not many pauses, rarely any filler like "uh" "um" etc. I was googling around about him, and it turns out he was asked about this. He's a medical doctor, and back in the day, they would dictate their notes over a phone in the hospital, to a recording that would later be transcribed.

After doing this long enough, he became skilled at composing his thoughts without writing them down.

blawson · on Aug 31, 2016

Listening to yourself makes a HUGE difference, in my experience. I've gotten used to recording myself for things like singing lessons, and it's been perhaps even more beneficial to replay the casual conversational exchanges with my teacher, as the singing itself.

In my case, I learned that I tend to speak in very quick bursts - get a thought out at a rapid clip, followed by a brief pause. Actively trying to measure comes across as far more intelligent I think, and I have time to think through what I'll say next.

Really hearing how many pauses, contractions, etc. you include in your regular speech helps you improve dramatically

Retric · on Aug 31, 2016

It works really well for a first draft if you just ignore what was already there. You can add notes / metadata, but just keep talking as you go. Mentally I think speaking just sits in another area as people who have trouble writing can act out a story with hand puppets for their kids in real time.

At the end you end up with something that takes minimal effort to get into a readable first draft state. And after that it's just editing.

amazingman · on Sept 1, 2016

When the speech recognition experience becomes good enough, the awkwardness —both individual and social— will dissipate. For example: at the moment, the entry and exit points for speech recognition are kludgy at best. Once your many devices operate as a single entity, the entity recognizes your (and only your) voice, the entity simply performs on command, and a reliable and comfortable/natural exit point is designed, so much of the weirdness will just disappear.

nickdavidhaynes · on Sept 1, 2016

With a lot of text entry that I tend to do in public - work-related email/Slack, texts to my partner, etc - I get a reasonable amount of privacy by typing vs. speaking. Even if I had access to a super-human speech recognizer, I would still feel really uncomfortable dictating my texts to friends.

ryao · on Aug 31, 2016

I came to the conclusion that speech recognition is more efficient at input on my cellphone months ago, but could not bring myself to make the habit either.

I only do it when I want to send a message in a hands free manner, such as a text message saying that I am stuck in traffic.

rtpg · on Sept 1, 2016

Recently (past 2 years, even less), the speech recognition tech has gotten much better in my experience. I can ask it things like I would ask a human, and it works.

I would try to use it again, but this time not concentrating as much on the speech, and assuming success.

Disclaimer: This works especially well for things like setting timers and reminders, but I don't really use it for replying to texts.

drzaiusapelord · on Aug 31, 2016

I'm getting comfortable with talking to my watch. Its pretty discreet if you do it correctly. The mics on these things are very sensitive, so you don't have to yell. In fact, whispering works just fine. The problem is that no one tells people this and as we have seen from the popularity of Siri and other tech that people just automatically start yelling commands like they're using a computer from a 1950s sci-fi movie.

Even in places with loud background noise, its a non-issue considering everything nowadays has two mics for noise reduction. I'm very, very surprised at how well Google voice-to-speech works. When it fails its almost always because I have a poor signal from t-mobile or am saying something that's just too difficult for machines to parse correctly.

TheOneTrueKyle · on Aug 31, 2016

This has been my problem with talking to the watch or phone. It is usually the internet in the area that is the problem.

Also, one thing that is relatively hard for voice recognition to distinguish is varying between two different languages. I am always cooking and words like, dashi, kombu, and gnocchi are hard to parse. There are uses of words from other languages that don't involve saying "translate x in English"

oaktowner · on Aug 31, 2016

By far my favorite thing about my smart watch is the ability to dictate texts.

I find, like some commentors above, that for longer-form composition, I often want to skip around. But for a text-length message the time to compose is drastically reduced by dictating it to my wrist rather than removing phone, unlocking phone, typing reply, etc. And it can be done largely hands-free.

russelldc · on Aug 31, 2016

I understand feeling weird speaking to your phone in public but I'm not sure I follow "hold the phone in an unnatural way". When I use speech recognition, I don't move my phone from the usual position.

_phaq · on Sept 1, 2016

He means holding it closer to your mouth, so you don't have to talk as loud.

ccozan · on Sept 1, 2016

I have a Moto X 2016 and it has a cool feature: simply lift the phone to the ear, as you would talk, and it starts the mic+recognition.

Then I just talk my question as I would have a conversation with a human and I get the answer spoken into the earpiece. Of course, sometimes I need to look up the info on the screen, since it doesn't answer straight.

tomcam · on Sept 1, 2016

I'm in my mid-50s, and I use speech recognition all the time on my phone. Multiple times per day on the average. I do have to speak like a news anchor, I don't hold the phone in a natural way, and I find it invaluable for productivity reasons. I don't give a shit if people think I'm holding my phone funny, or if they think it's weird to talk quietly into my phone. Life's too short for that. This answer was dictated into my iPad, by the way

_phaq · on Sept 1, 2016

It's the exact reason for me why I'll probably never use speech recognition for anything more than trying it out. I neither want to be the weird person talking to their computer/phone, nor do I want to annoy people, nor do I have any interest in sharing everything I do with everyone in hearing range.

prawks · on Sept 1, 2016

But... people talk to their phones all the time :) Just hold it like normal when you're talking into it!

Fricken · on Sept 1, 2016

In public I hold the mic close to my mouth and lowtalk into it, no need to bring out the transatlantic accent radio voice.

I can hold a coffee with one hand, see where I'm going with my eyes, and still enter text. It became reliable enough for me to use habitually a couple years ago, and it keeps getting better.

ihaveajob · on Sept 1, 2016

This is awkward for me in public as well, but I found that the only way to get any (light) work done on days when I'm also taking care of my baby at home, is by dictating. Suddenly I can reply to emails and quick asks from coworkers, and still be present.

stronglikedan · on Aug 31, 2016

I'm old, and I feel the same way, but: I'm trying to take advantage of the technology more and more, despite my social quirks of using it, since I can no longer use my phone without excruciating thumb pain.

greenshackle · on Aug 31, 2016

I'm with you, maybe I'm getting old too.

The local language here is not english, but I use english voice recognition since it's better, which adds extra weirdness.

I only use it to set alarms and reminders on google now.

ddavidn · on Aug 31, 2016

I'm young. I also think this is "weird." I am happy, however, when I'm alone and I remember that my phone has this functionality.

rossjudson · on Sept 1, 2016

Given the number of bluetooth headsets out there, nobody would look at you twice for talking in public, these days.

taneq · on Sept 1, 2016

That must be a local thing. The only time I ever see people talking on bluetooth headsets in public, it's the older guys who think they're cool because they're using expensive tech. Everyone else just holds the phone up to their ear.

nommm-nommm · on Aug 31, 2016

It feels weird to me to talk into my phone even in private.

S_Daedalus · on Aug 31, 2016

I think it just means you have a bit of empathy, and awareness of how other people might view you. Many, if not most people seem to lack those qualities in general, never mind in relation to their phones.

On the topic, something which reads your subvocalizations is really needed, and could even increase the speed!

JoshTriplett · on Aug 31, 2016

> On the topic, something which reads your subvocalizations is really needed, and could even increase the speed!

https://en.wikipedia.org/wiki/Subvocal_recognition

jbombadil · on Aug 31, 2016

Based on what they show on the video alone, I feel this study is unfair. They are comparing "one of the best commercial speech recognizers out there" and in absolutely ideal conditions (no external noise, no echo etc.) against a normal on screen keyboard with no prediction enabled. I'm no pro and I can type much faster than that with SwiftKey.

I'm not saying the study doesn't have merit, but "Speech _Is_ 3x Faster than Typing for English and Mandarin Text Entry on Mobile Devices" sounds a bit of a stretch.

treehau5 · on Aug 31, 2016

Agreed. I type 3-4x faster than most using Swype on Samsung phones. I have become so used to it I can also type without looking at the screen in most cases.

lqdc13 · on Sept 1, 2016

To provide a counterexample, I definitely type less than 50 (their median) wpm on a mobile device. More like 10 - especially if there are symbols involved that are not on the first two screens. On blackberry maybe 50.

On the other hand, speech is a lot worse. A lot of times you can't afford to have 10%, 5% or even 1% error rate since messages are usually short and you cannot infer intended meaning easily. So my WPM accounting for correcting speech with swype is <10.

anonova · on Aug 31, 2016

It's silly they only compared it to a normal on-screen keyboard and not any other input methods, predictive or not (Swype, etc.).

It should be also be noted that their speech tests were done in a controlled, silent environment. I'd expect the error rate and time to complete a phrase would dramatically increase in a noisy room.

caffinatedmonk · on Aug 31, 2016

They should have tried it with a full size physical keyboard and larger on screen keyboards too.

christofosho · on Aug 31, 2016

I would also wonder what the share of other input methods are, as compared to speech and normal texting/typing.

Piskvorrr · on Sept 1, 2016

What this says is "software keyboards suck." That's the elephant in the room. "Better than utter crap" does not mean "wonderful." I do wonder how the SR test would stack up against a hardware keyboard - a device-sized one, and a full-scale one.

(Well of course on-screen keyboards suck: they're a skeuomorphic ugly hack that has been bolted onto a touchscreen. With a slideout hardware QWERTY keyboard on an Xperia Pro, I was typing slower than on a full-sized kb, but still several times faster than any onscreen input - predictive or not, swipe or not.)

pluma · on Sept 1, 2016

There's zero tactile feedback on software keyboards and the way most people are using them is basically the good old "hunt and peck" style that's the least efficient way to use a hardware keyboard. I'm not sure the results would be any better or worse on a desktop on-screen keyboard with mouse input.

collyw · on Sept 1, 2016

Yeah, touchscreen keyboards were a big step backwards in usability.

minouye · on Aug 31, 2016

Speech-to-text seems like a technology that suffers from the 9x effect[1]. Creators overvalue the impact of voice transcription, and users overvalue their existing input options.

Even for the desktop, speech will be roughly 2X faster than typing, but I have no desire to buy a copy of Dragon because I like/overvalue my keyboard.

[1] - https://hbr.org/2006/06/eager-sellers-and-stony-buyers-under...

rgbrenner · on Aug 31, 2016

I think it could catch on.. if speech to text worked correctly every time.

If I hit a key on a keyboard, it works correctly every time. If I make a typo, it take a fraction of a second to hit backspace and correct it. Mistakes are cheap to correct.

In comparison, if I talk to my phone, it makes a significant number of mistakes.. making me talk slower/louder/differently to use it. And when it does make a mistake, it's more costly to correct.. I either have to start over, or pick it up and hit backspace. So I disable it and never use it.

spullara · on Aug 31, 2016

This is the issue I have with speech to text systems as well. None of them have an intuitive, reliable way of making corrections that I don't have to painstakingly learn. Also, Google's keyboard (like SwyfteKey) is far far faster than typing normally. Amusingly, it also suffers from poor correctability...

ryao · on Aug 31, 2016

It will not catch on unless a privacy device is made so that people in office enivonements csn do it without disturbing their co-workers.

dragonwriter · on Aug 31, 2016

> It will not catch on unless a privacy device is made so that people in office enivonements csn do it without disturbing their co-workers.

The privacy device is called "an individual office", and its pretty important for people in "office environments" to be effective, independently of use of voice recognition.

They've been around for quite a while, but their popularity goes up and down periodically.

zerocrates · on Aug 31, 2016

So we just need the Cone of Silence from Get Smart.

rgbrenner · on Aug 31, 2016

In the future, I could see a device that picks up subvocalization filling that role.

ryao · on Aug 31, 2016

Next someone is going to suggest that we all learn to program by voice:

https://www.extrahop.com/community/blog/2014/programming-by-...

I cannot find the article at the moment, but there was one discussed on hacker news recently that mentioned a privacy device for telephones before the 1950s that achieved the same effect as cupping your hands over the receiver so that others in the room could not hear what you were saying into it. If programming by voice takes off, I imagine such a thing would be a necessity to keep office environments sane. The same goes for regular text input by voice.

schoen · on Sept 1, 2016

I'm sure the device you're thinking of was the Hush-a-Phone, which was also important for competition policy history.

https://en.wikipedia.org/wiki/Hush-A-Phone_Corp._v._United_S...

zeveb · on Aug 31, 2016

> cannot find the article at the moment, but there was one discussed on hacker news recently that mentioned a privacy device for telephones before the 1950s that achieved the same effect as cupping your hands over the receiver so that others in the room could not hear what you were saying into it. If programming by voice takes off, I imagine such a thing would be a necessity to keep office environments sane.

Or maybe we'd all get private offices with decent sound-proofing.

Hmmm … that's not going to happen.

melling · on Sept 1, 2016

Well, that person suffers from RSI, so there's a certain group of people who would benefit from it. Here are a group of resources if anyone is interested:

https://github.com/melling/ErgonomicNotes/blob/master/progra...

rossjudson · on Sept 1, 2016

Never tried to program with voice, but I did set up Dragon and macros so I could dictate javadoc. That worked pretty well. Since a lot of the language in javadoc is fairly regular, I could work pretty quickly.

marcosdumay · on Aug 31, 2016

I can't talk as fast as I type. Is that 2x figure for some untrained "average user"?

Edit: Yep, looks like I completely missed my estimation of talking speed.

OxO4 · on Sept 1, 2016

According to Wikipedia, a "comfortable" speed for audiobooks is 150-160 wpm, whereas the "average professional typist" achieves 50-80 wpm [0]. So 2x seems to be fairly accurate for the average user. Real-time transcriptions in court rooms etc. are usually done with stenotype machines [1].

[0] https://en.wikipedia.org/wiki/Words_per_minute

[1] https://en.wikipedia.org/wiki/Stenotype

Piskvorrr · on Sept 1, 2016

Are you comparing the polished output (audiobooks) to raw input (typing)? That's about as unfair a comparison as you could possibly make: I'm positive (from experience) that the audiobooks undergo massive retakes and sound editing. What you hear is not a person narrating the whole thing in a single take; the raw input rate (number of words / time spent in front of the mic) is about 20 wpm, if you're really really lucky (you will throw away a lot of material). And halve that for edits, which are unavoidable. Plus the recording happens under ideal conditions, with full and perfect focus on pronunciation.

TL;DR: audiobooks are almost completely unlike voice recognition, you're comparing apples and oblique angles.

jpalomaki · on Aug 31, 2016

With speech the main problem (I have felt) is that mistskes are difficult to correct.

I haven't yet seen an input system which would combine speech with touch in nice way.

aesthetics1 · on Aug 31, 2016

I'm with you on this one. I had to set a reminder to "Get the soy milk", and Google's speech recognition picked up "Get this white milk". I knew what it meant so I just saved the reminder instead of correcting it. Unfortunately I cannot rely on speech to send texts, dictate emails, or have any other interaction with other people.

Eridrus · on Aug 31, 2016

It's surprising that they don't show you a screen with the transcription and let you easily select bits you want to re-transcribe.

Text selection on phones usually sucks, but that's only because you need to disambiguate between clicks and highlighting; if you had a dedicated screen that assumes a user wants to highlight a section it should be a lot more responsive.

rossjudson · on Sept 1, 2016

Don't correct. Just keep talking.

I found speech recognition to be useful mostly for brain dumps. I've found I tend to think best when explaining or talking. I used to bring along a voice recorder on long drives, capture what I was thinking, then run it through voice recognition later. It was often a garbled mess, but usefully captured a lot of thinking. Modern recognition systems would do a lot better. Yeah, I should try that again.

jpalomaki · on Sept 1, 2016

This is a good point. I've been using the same strategy for taking notes in meeting where there is lots of talk. Just write down stuff, not concentrating on format or content. Then after the meeting go through the notes and clarify. Trouble is, writing even quick notes, is difficult to do if you also need to talk and think at the same time.

Probably you could build some neat solution around this. Like a speakerphone style device with array of microphones to make it easier to identify who is speaking and pick up the words. Or maybe a regular smartphone is enough. The device/app would then just transcript what is spoken and annotate it with names.

Compared to audio recording the benefit would be that going through the written raw stuff is much faster and if you were present, then you can probably recall the stuff even if the transcription is not perfect. Also it might be more socially acceptable to use this solution that to record the meetings.

skoocda · on Sept 1, 2016

Czech out my project, Spreza. We're not targeting voice input, per se, but instead automated transcription. Nonetheless, the correction issue persists in both use cases. We solve it via a homonym suggestion drop-down box. Double-click an error and replace with the correct word.

Piskvorrr · on Sept 1, 2016

A Czech here, hello! Talk about Muphry's law :D

JoshTriplett · on Aug 31, 2016

https://www.keithv.com/pub/sd/speechdasher.pdf

douche · on Aug 31, 2016

But is it 3x as valuable? Anyone can talk a mile a minute and let the words flow out of their face-hole as fast as their brain can string words together.

Slowing down to think about and review what you're communicating is a feature, not a bug, of text.

tree_of_item · on Aug 31, 2016

How is that even relevant? It's not audio vs. text, it's speech vs. typing. Just use speech-to-text, and you can see your text on the screen before you give it to Siri or your friend. Still better than typing.

oopsies49 · on Aug 31, 2016

I think it's pretty valuable. I would rather talk to my phone to send a text message while driving rather than pick it up and cause an accident.

"Ok Google, text Jim Traffic is bad, I'll be late"

emodendroket · on Aug 31, 2016

While voice commands may be better than using the keyboard, they're still distracting and dangerous. It is best to avoid interacting with your phone at all while you drive.

nommm-nommm · on Aug 31, 2016

Sending text messages while driving is extremely dangerous no matter the input method. Hands free is not safe nor safer.

1812Overture · on Aug 31, 2016

I don't stink that speech rank addition is completely ready to real place tie pin. To mulch time is spent core wrecking what you rowed.

schoen · on Sept 1, 2016

This reminds me of the 1920 poem "The Typewriter Revolution":

https://trialbysteam.com/2010/03/09/d-j-enright-the-typewrit...

(I think people in the comments there missed and/or misunderstood some of the poem's references, a few of which are scatological.)

nibs · on Aug 31, 2016

I think a world of all-speech interfaces would be flawed (if that is the natural conclusion of this line of thinking). I may be able to speak 3x faster than I type, but I read 10x faster than I can listen to someone speak. Speech-to-text is good for typing but that is not to me analogous with the idea that it should supersede visuals.

S_Daedalus · on Aug 31, 2016

Why would embracing one destroy the other? I think a combination of gestural interface, subvocalization voice recognition, and AR could be the big winner in our lifetimes. The text won't be bound to a screen, you don't give anything up, you just gain.

When you need to get into serious writing or bulk data entry, maybe it would be a keyboard.

willvarfar · on Sept 1, 2016

Soldiers often use throat microphones so you can speak so quietly its silent.

I wonder if a front facing camera on a phone can capture your throat in sufficient detail to decipher what you say even if you speak silently if you hold the phone in your palm as people do when browsing?

nathan_f77 · on Sept 1, 2016

Here's something positive: This kind of research might indirectly make people healthier and save a lot of lives.

Calorie tracking is a proven way to lose weight, but all the tracking apps that I tried feel like they take too much effort. So I came up with an idea a few years ago: you should be able to just say what you had eaten out loud, and then use speech-to-text and NLP to search for each item and count up the calories.

I never got around to building that, but these guys did: https://www.nutritionix.com/app

It works amazingly well. Not only is speech 3x faster than typing, it's also much faster to have a free-form text field that is automatically parsed. And they integrate with Amazon Echo, which I look forward to trying out.

I've been thinking about many other ways to automate calorie tracking. For a while I thought the answer would be an AI that recognizes photos of food, but that doesn't feel important any more. I think speech-to-text takes approximately the same amount of time as opening the camera and snapping a photo. There have a been a few minor errors with Apple's voice dictation, but so far I've seen 100% accuracy from the actual text searches.

So anyway, speech-to-text research. It's all important.

r-w · on Sept 1, 2016

Although I kind of wish Apple hadn’t baked speech recognition into their parsing algorithm. Not everything I say will be in their dictionary, and there are times I’d rather just be able to type the query in myself than have Siri misunderstand it and then have to manually retype the parts that were misinterpreted.

AIMunchkin · on Aug 31, 2016

One of the amazingly dumb things about my Nexus 6 is that when it's listening for my voice it still plays incoming SMS message tones that it then takes in as part of speech recognition and messes it up. Why? Really, just why?

I end up having to correct it with typing anyway.

contingencies · on Sept 1, 2016

I am a native English speaker and a second language Mandarin speaker who learned at early adult age.

I personally consider typing English and Mandarin to be very different. There are linguistic, psychological, and cultural issues at play.

Firstly, the dominant Chinese input system is phonetic (pinyin), which perhaps implies some kind of different mental state when typing.

Secondly, it is the case for adult learners like me but also reportedly for many native speakers that precise Chinese characters are easy to forget. People have visual memories of 3-10,000 characters, but perhaps can write confidently as little as half of them from memory. The phonetic input system presents a context-based suggested intent shortlist and the user is requested to select the character they intended.

Sometimes, in more extreme cases, particularly for native speakers with heavy accents or new second language speakers, users may be unsure which character to select or may even input an incorrect but close phoneme, scan for visual recognition of the correct character, fail to find it, then type a different phoneme.

Frequently, typing Chinese is the only major creative interaction that Mandarin speakers have with modern Chinese text, since writing is becoming increasingly rare outside of a school or government-form context.

imjustsaying · on Sept 1, 2016

Yeah and it's even slower to listen to. I still have a 50+ second long recording from someone that I haven't bothered to listen to, because if they're too lazy to type it out, why should I be expected to wait nearly a minute to listen to it?

brokenmachine · on Sept 1, 2016

Whenever I try to use speech recognition on my Android phone, it's frustrating because when it recognizes the wrong word, there's no way (that I know of) to delete a word or move the cursor.

Also when I try to put a period in (we call them "full stops" in Australia), half the time it inserts a period, and the other half it literally writes "period".

I would like to see some documentation on how to use it better, I had a go at googling for some a while back but couldn't find anything useful. I ended up with the impression that such editing commands weren't implemented.

I still have a go every now and then to see if its improved with updates, but it's basically unusable in its current state.

jwtadvice · on Aug 31, 2016

Unfortunately, speech recognition has been used for mass surveillance and is likely to be abused in the future. There needs to be a hardware control on microphones that allows the external user to control whether they are being listened to.

skoocda · on Sept 1, 2016

I agree 100%. Right now we don't even know the number of microphones our devices have, much less the contexts in which they may be active. Speech recognition protocols should also integrate low-level encryption standards to ensure access is only provided to trusted parties.

Houshalter · on Sept 1, 2016

Is this surprising to anyone? I can speak way faster than I can type even under the best of conditions, let alone on a shitty mobile keyboard. I'm surprised it's only a factor of 3. I tend to avoid writing stuff on mobile and wait until I get home, because it just feels so painfully slow compared to a real keyboard.

I see a bunch of comments disputing the result. To those people, do you really think that typing on mobile is as fast or faster than speaking?

ramblerman · on Sept 1, 2016

Nobody is disputing that speaking is faster.

It's comparing the two as viable inputs on a smartphone. Speech recognition is far from perfect and doesn't yet allow you to talk like you would to a human. Did you read the article?

flukus · on Sept 1, 2016

I can type anywhere but there are a lot of places where speech is impossible/inconvenient. I wouldn't want to work in an office with everyone shouting commands at their computer. Audio feedback doesn't work if I'm listening to music from another device. The computer can't here me if I'm playing music, etc.

nitwit005 · on Aug 31, 2016

I have to type a single number into Google Maps for it to auto-complete to my home address. That's not really a beatable speed.

And there is the tiny problem that I can't even pronounce some of the things I've typed into Google. Examples include: Japanese manga names, company names, and foreign names from news articles.

monsieurbanana · on Sept 1, 2016

Those are specific needs, speech recognition isn't supposed to be a 100% replacement for a keyboard.

spatten · on Aug 31, 2016

From the paper, the IST on the graph stands for "Initial Speech Transcription", and those data points are for the speech-input text before any corrections were made. The other "Speech" data-points include time to make corrections either by using the keyboard or using speech recognition.

teddyh · on Aug 31, 2016

Tap-type vs. Swype vs. Speech Recognition:

http://www.wastedtalent.ca/comic/text-what-i-mean-not-what-i...

siliconc0w · on Aug 31, 2016

Error correction is more intuitive for typing - typing errors are usually obvious transpositions or missing characters which our brains easily correct. When a speech recognition engine makes a mistake, it's less obvious to the user and usually more jarring and the word used is an actual word but just wrong in context. This makes it more difficult for the user to decipher. So every if it is more accurate or faster, the cost of inaccuracy seems higher.

aaron695 · on Sept 1, 2016

People will never do it, it's like saying control your phone with your genitalia in public. I don't think it'll break social bounds before better methods are available.

But at the end of the day typing speed doesn't matter. It's not whats limiting us (In English, not all languages)

But I think many people equate speak recognition with language parsing which is why people seem to be obsessed with it.

minikomi · on Sept 1, 2016

As an aside, I use 9-key swipe input for Japanese and it's really great once you're used to it. I feel like there's still some way to input English we haven't found yet which really suits the language..

https://www.youtube.com/watch?v=ClDenxOxeeM

ojii · on Sept 1, 2016

You can get it for your PC too https://www.google.co.jp/ime/furikku/. Very good for coding.

reitanqild · on Sept 1, 2016

I used to like LibreOffice/OpenOffice.org predicive input and I still sometimes wonder why Windows doesn't provide system-wide text prediction like we have on mobile devices.

voltagex_ · on Sept 1, 2016

I once saw someone using a "radial" keyboard on an Android device. Unfortunately at the time I saw it, the company had vanished.

I'm using SwiftKey (the non-online version), and I can type pretty quickly after the keyboard has been "trained". I think a combination of prediction and a better keyboard layout would help input speed quite a lot.

pbhjpbhj · on Sept 1, 2016

I just tried Google voice and was surprised hour fast it was. The only problem for me read with punctuation. It always seems to type the word rather than using the correct punctuation mark.

So far this has taken me 1.25 minutes using swipe input on Google b keyboard. Couple of errors and issues (probably because i use my fat thumbs for entry).

[3:07 minutes.]

pbhjpbhj · on Sept 1, 2016

I I just trying to Google Voice and was surprised how fast it was. The only problem for me was with punctuation. It always seems to type the word, rather than using the correct punctuation mark. So far this has taken me 37 seconds using voice input from Google keyboard period couple of errors and issues open bracket probably because I use my fat from the entry for my voice close bracket period

[1:15 minutes]

f_allwein · on Aug 31, 2016

I use Swype, which is significantly faster than typing - maybe a good alternative if you don't want to speak to your phone in public.

http://www.swype.com

Then I guess we'll get used to people speaking to their phones very soon, just as we got used to people talking on mobile phones back in the day.

montibbalt · on Sept 1, 2016

I'll have to try when I get home, but I'm curious how speech recognition will do against the message chosen for the touchscreen record. I'm not 100% sure how to even pronounce a couple of the words: http://www.guinnessworldrecords.com/news/2014/5/fastest-touc...

samfisher83 · on Aug 31, 2016

I wonder what it is with a normal keyboard? I think I can type faster than I can talk. I never learned the keyboard, but my fingers know where all the keys are. I can't tell you the order of letters of the qwerty keyboard, but my brain knows where all the keys are.

skoocda · on Sept 1, 2016

You can't type faster than you can talk. Trained stenotypists come close, but there's nobody who can consistently hit 130-150 wpm on a QWERTY or DVORAK keyboard.

fsiefken · on Sept 1, 2016

You can - if 150 wpm is what you speak, with Plover and training you can reach 120-225 wpm on a qwerty keyboard with NKRO capability. But it's indeed steno (on a regular keyboard). With regular strokes and autocomplete you might reach 70 wpm or more https://www.youtube.com/watch?v=Wpv-Qb-dB6g

Bakary · on Aug 31, 2016

Muscle memory is a strange but potent force. I can type at a high speed on qwerty and dvorak but the more I think about the layout the slower I am.

ryan-allen · on Sept 1, 2016

When I'm at home or in my car I use speech to write text messages and perform google searches all the time. It's so much more convenient now that the voice recognition is so reliable these days :)

a_c · on Sept 1, 2016

It would be great if I can register a voice for certain word, regardless of language. e.g. pawned => pwned, snafu => snafu, and some other language pronunciation to some foreign word

dboreham · on Sept 1, 2016

Hmm...for me typing AND speech recognition on mobile devices are horrible. To know that one is slightly more horrid than the other is small consolation.

neves · on Sept 1, 2016

I'd like to know how does it compare with swipe methods. I feel faster with swipe, but doesn't know if it really is.

ww520 · on Sept 1, 2016

Dictation style input would be great in situation where hand dexterity is not optimum, like in a car.

NicoJuicy · on Sept 1, 2016

I don't type, i swipe... How faster would speech be then?

dominotw · on Aug 31, 2016

what about emoji, gifs, pics, hyperlinks. texting is way more efficient than talking.

tree_of_item · on Aug 31, 2016

Talking can be used to send texts. People really don't seem to be getting this.