Best to say "yes! but only some of the time". It's something we're working on right now. You can be 80% accurate, by some metric, but it's still not good enough usually to pass a human's sniff test. Good speaker labeled audio in various settings is hard to find.
There are several ways to look at this problem too.
L1: exact speaker is known (voiceprint) and can be picked from all humans with accuracy, even when others are talking
L2: exact speaker is known from a subset of people, even while talking in a conversation with others
L3: speaker1,2,3,... are identified accurately
L4: speaker changes are identified accurately
L1 is a really hard problem. L2 is fine if you don't care about the time domain (knowing exactly when they spoke), but is harder if you have to accurately detect changes. L3 is about as hard as L2 but the big goal isn't who anymore, it's when. And L4 is easier, kinda like putting line breaks in when human transcribing a file. Not too bad. All of them need better data sources.
We do custom models (train the full DNN, not just tack on a new text language model) using transfer learning and it works for small numbers of examples too.
Glad to hear you asking about fuzzy search. That's something we do (it's actually what Deepgram started on!). It's not in the docs at the moment (tends to confuse people who are looking for transcription, we're working on how to present it in a better way). You can submit with queries and get back confidences and timestamps.
Many times the model doesn't need any training but it does increase accuracy if you do training and can get really good if it's focused (it's a lot like wake word detection -- we don't offer WWD as a real product yet either, just saying the challenges are similar). Best thing to do is search for phrases if you can, that really helps signal/noise.
It's a metric that's hard to nail down because there is so much parameter space that you are flattening into one number. Also it doesn't address the "I care about these five high value words (that are made up), can you recognize them?" like product names and company names.
There's ~4 types of audio:
Phone call
- close microphone
- conversational
- low bandwidth audio
- two way conversation
- more industry specific terminology
Meetings
- 2-5 people
- conversational
- far away mic
- better bandwidth audio
- more industry specific terminology
Broadcast
- usually good diction
- close mic
- good bandwidth audio
- more general terminology
Command&Control (saying to your phone: "go to <this address>")
- close mic or array or mics far away
- short audio chunks, 2-10 seconds
- spoken in a way that makes it easier to recognize (learned behavior)
- usually a lot of widely known named entities are said
In that full aggregated line up I bet we'd be in the 22-24% WER pack. That'd mostly be because we focus only on phone calls and meetings. We don't try to improve command&control/broadcast/podcast type yet. Broadcast because it's perceived as lower value (so customers tend not to pay for good recognition for it [we do train models to make them better for specific customers/verticals(usually a reduction of errors by 20-40%), but the buyer has to have a budget for it for now, but there are ways to make it cheaper in the long term]), command and control because you have to have a fleet of devices out in the field collecting data and driving use cases and we don't have customers there yet.
I guess maybe a better way to ask is which acoustic environments do you excel in?
In terms of gathering data, I'm curious how to plan to get the 15K audio hours it takes to train each of these models. The most you want to segment it (like through acoustic environment or genders), the more data you need. Do you have a cheap way of generating high quality data?
I didn't answer "Do you have a cheap way of generating high quality data?". We have good ways to do it. They're not that cheap though. It's expensive (organizationally and real $$$) to label large amounts of data no matter what.
But we do utilize our capabilities to better tackle the wild data gathering and labeling. For instance, "is every labeled minute just as valuable as any other?". Definitely not. So if you can find and select only the data you want to label, rather than indiscriminately labeling a bunch, then you can increase your overall efficacy.
If you're training from scratch around 10k hours is needed to get a good model, but when you are transfer learning you don't need nearly that much (100 hours gets you a lot).
We excel in phone call and meetings settings. I.e. the typical sales/office/support environment.
Baidu trained their DeepSpeech model with 6000 hours of English to get a model similarly accurate to Google/Microsoft, it may just be the type of quick model your using that needs 10k hours to achieve good results.
Mozilla's DeepSpeech is quite interesting, languages like Turkish can get a decently usable (~20% WER) model with just 80hrs of training data (no transfer learning, starting from a clean slate).
Yep, all good points. One thing to consider is that generalization is a big problem. It's easy to get good on a specific dataset nowadays (like 5-10% word error rate level on academic datasets), but that same model might do 40% WER on data in the wild.
This is a seriously fertile area where you get to "define the new interface".
It's a big problem though, since few buyers know they want those things. Around 95% of customers come into it with "give me the transcripts" and discover over time they want these other things too (some graphical, some technical). They just didn't know it was available.
New GUIs and data representations is a big part of it. Getting accuracy and scale in place is a big part. Building awareness and distribution of what's possible now is another big part.
Re: JSON Error; We fixed that doc link error you saw (it was pointing in the wrong place since we _just_ updated it).
Tableau (and the general business analytics space) have done a good job at reframing the problem as: "don't think about what you want as a leader at a company; instead, democratize data access so your team can decide what it wants, and pay for democratization not for your own features." See for instance: https://www.forbes.com/sites/briansolomon/2016/05/04/how-tab...
Arguably Elastic is a success story about bridging the worlds of an API-first technical stack with a democratized non-technical analytics framework. And they started by just powering excellent search, and building value-add layers over time. But they built into a then-vacuum of API offerings, whereas there are many other (potentially inferior, but well-funded) speech-to-text APIs. I'll be avidly following you guys as you navigate the space, and hopefully you're able to find some good "hooks" or uniquely-easy-to-roll-out integration stories that strike a balance between focus on technical excellence and driving awareness in a super-linear way.
Price starts at $1/hr billed in 1 second increments. Frequently we charge less than that, since the price is dropped with volume, and that's typically businesses have a steady amount running through them (a few thousands hours). Medium usage scale would be $0.25-0.75/hr (e.g. 10,000 hours to 100,000 hours a month scale). Large usage is around 10,000 hrs+ per day and the price can be much lower per transcribed hour (like $0.15/hr).
That's the ballpark for cloud + batch mode. If it's cloud + realtime it's a little more. If you need it on premise it's a little more (we work with integration partners to do parts of it).
Pricing for speech is interesting since there's more than just how many $/hr in the equation. Usually businesses care about turnaround time, throughput, failover/availability, and a collection of features. So we usually want to talk about those goals and price accordingly to support 'em. I definitely wish I had a better way to frame it than "it's complicated"!
So basically a $10k monthly commit is required to train a custom model? Would it be possible to pay for the training itself if you are a lower volume user?
Pretrained models are very nice to spin up a system and start using it. You need that because training the model is so hard. But, the pretrained models are by no means a good general model. They are trained with narrow params (like #/size of datasets), so the ability to generalize is very low. It's not uncommon to train a system like that to be 5% WER on Switchboard (think MS/IBM), then have that same system perform at 40% WER on other audio.
You're doing awesome (arduous) work. The text normalization is especially a total bear. I feel your pain. Limiting your text to one file is good in many ways because it allows you to scope down the amount of work needed to do a comparison (but it's a big systematic risk, but hey, there are only so many hours in the day).
Your previous blog post helps in understanding how much work needs to go into comparing speech services. It's super common to undervalue just how much processing a human is doing innately while listening to audio; hearing words, feeling out ideas, resolving ambiguities, etc. So, it's awesome to see deep work into it (besides the speech teams working on these problems like at Google, Baidu, Microsoft, Deepgram [btws I'm a founder of Deepgram]).
I wouldn't be so quick to say the differences in WER should be attributed to how 'modern' the system is. It's more about the areas they play in; what audio type they care about, what training datasets they use, what post processing they do, and language models they choose to apply. (Speed/TurnAroundTime gives you a much better indication of how modern a system is.)
For many speech transcription systems, they focus on specific types of audio as their target market. There are ~4 main types: phone (customer support/sales), broadcast (news/podcast/videos), command and control (siri/google assistant), and ambient (meetings/lectures/security).
Google's video model is perfect for what you are doing (broadcast/podcast, 2 dudes talking into probably pretty good mics).
In other instances the results will be very different (if you compared phone calls, for example). It won't be different just in accuracy, but also speed (throughput and latency), price, and reliability.
It's awesome to see an in depth comparison being discussed broadly. Speech interfacing and understanding is just getting started. We're still at the tip of the Intelligence Revolution and there's still a long way to go. The scale of compute and data is huge, even to bring just one language up to snuff.
Aside: It's a dirty little secret that there actually aren't 20 different speech recognition companies in the world using 20 different systems. There are only a handful (many use Google and tweak the outputs). They are mostly doing one of four things: using old and aged tech, using old but well-oiled tech (like Google, this takes a ton of manpower and no other company spends the money to do it), using an open source spinoff (like Kaldi or Mozilla), building your own from scratch (like Deepgram), or reselling someone else's.
If you care about current times, this is a reasonably good finger in the wind in Sept. 2018:
Use Google if you are doing command and control or broadcast audio, do not use Google if you are doing meetings or phone calls or you need a reliable system (it's unreliable at scale). DO use Google in all cases (even phone/meeting) for audio that is in a language other than English (no other company is even close).
Use Google to prototype systems and teach yourself about how to use a speech recognition API and what results to expect as a baseline.
Do not use Google if you need scale and speed and reliability and affordability.
Do not use Google if you need to use your own vocabulary or if your audio has repetitive things being said in it that have accents or jargon (like call centers). In tat case, use a company that can do a true custom acoustic model and vocabulary for that (like Deepgram). There are only a few companies that will consider doing this (and Google is not one of them.
Expect that many more things are going to be addressed.
Think of it like: what can a human do?
A human can jump into a conversation and quickly tell you: there are 3 people, speaking about rebuilding a feature in the main code, two people are male, one is female, male1 and female1 are doing most of the talking in the beginning, then it's the two dudes at the end, it sounds like the recording is of a meeting they are having, they never came to a definitive conclusion and next steps, they spent 80 minutes in the meeting. All of that (and I'm sure more) will be done by machine in the future.
Anyone able to find any data on time 'til accuracy? I don't see it (even in the video linked in another comment).
Sure it's nice to "achieve 90% scaling efficiency" in images/sec, but images/sec alone doesn't get you where you want to go. Increased accuracy per unit wallclock time is what you want.
The dark matter is a gas not a solid. It's not orbiting the sun, it's orbiting the galaxy (technically, not even the galaxy, it's orbiting inside it's own halo mostly). The orbits of the particles would be highly irregular, going in all directions, many with very eccentric orbits, etc. The sun is moving through the "rest frame" of the galaxy at a ~constant speed, but the earth is orbiting the sun so it has fractionally varying relative speed compared to the rest frame (speeding up and down every year relative to the rest frame).
We still come back to this for fun. The original device was an intel edison but recent variants have been based on the raspberry pi zero w.