The video is one of the best I've seen, really makes me excited for the product.
As for the product itself, I think the biggest "feature" is the ability to cut the audio by cutting the transcript, which makes it easier to quickly edit files. Transcribing is pretty common, the dubbing also sounds interesting but depends on how good the quality is.
I think the use-case for this is not YouTubers who expect high-quality, but social-media users who want to generate more average-quality content in a short amount of time.
That is BY FAR the best product video I've ever seen. Wow. "Really makes me excited for the product."... well said! I think that's what marketing is supposed to do.
I don't even have to look at the credits to know who put the video together - https://sandwichvideo.com/
They are f*king awesome and produce the best ads/product videos i've ever seen in my life. Unfortunately extremely expensive (the cost of 2 Ferraris) - unaffordable unless you are a SV startup that has raised $millions :)
Yes! Extremely helpful and time saving. The app has changed since then but a few years ago I had a blast using it.
You can also indicate who is speaking as you generate a transcript. I would then export a closed caption file and use ffmpeg to generate the video based on who was talking before merging it with the audio (think no budget v-tuber, just a different image on screen depending on who’s voice is playing)
The download page doesn't show anything particularly helpful if you're on an unsupported platform (eg Linux). From the source:
window.onload = function detectOS(){
if (navigator.userAgent.indexOf("Mac")!=-1) window.location.replace("/download/mac");
if (navigator.userAgent.indexOf("Win")!=-1) window.location.replace("/download/windows");
if (screen.width <= 992) {window.location.replace("/download/other-device");};
return undefined;
}
It's a very cool product, which I've only used briefly.
However, product aside, their promotional videos are _phenomenal_. Not sure if they are making these in-house or some company is putting them together, but someone is doing a great job.
But--potentially dumb question--where the heck would you encounter these videos in the wild? These sadly aren't e.g. the types of video I see as Youtube ads.
“Typically, $200K is a good starting point, though we’ll also work with a lot more. If you’re a startup with an idea we absolutely have to get behind, we can get creative with equity.“
Depends entirely on who you hire and how much you let them know about your budget. I am confident that I saw a HN post years ago that mentioned Sandwich starts at $250k but again client work is almost always very negotiable.
Wow, it’s good but $250k is a shedload of cash, you could hire 3 people in this field for a year for that (certainly in London). They could do all your training material too...
I’m not sure you’ve noticed but there are more actors than there are jobs. The actual shoot time for something like this is what 2-3 days. Let’s budget a crazy $50000 for those things (reality - you’d be hard pushed to spend $20k).
The FAQ answer that says $200k should probably be authoritative, but... I'm just guessing Sandwich is actually surprisingly flexible about budgeting.
They do have a series where they shot three videos at three different budget levels ($1k, $10k, $100k) for the same product (https://sandwich.co/clients/wistia/). Take from that what you will...
It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech
> Are there any really good open source speech to text programs?
I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).
There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.
Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..
Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).
Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.
Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.
RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).
To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.
The user quotes sound almost too good to be true. When I think of the average YouTuber discussing video editing software, the words "best productization of machine learning I've ever seen" do not come to mind.
I described it in multiple calls at Stripe today as "The most impressive demo I have seen in years", and I think you can reasonably understand that I didn't join HN 10 years ago as part of a stealth mission to a) get hired by Stripe and then b) talk up audio editing software to them so that my shadowy masters got that sweet sweet podcast budget.
I just downloaded it and tested it out. It really is a very cool, polished product.
The video cuts when you remove "uh" and "um" are too jarring for anything professional, but OTOH for some internal work stuff I could see a use for this. And as someone who has used 0 video editing software, I was able to start making changes within like a minute, which I think is quite impressive.
That said, not sure I have enough use case to pay for it, personally.
NB: it's kind of impossible to prove you're not a sock puppet, because of _course_ a sock puppet account would say they're not a sock puppet account.
If I could revise my previous comment, I didn't want mean to imply the feedback wasn't from a legitimate user. Rather, the comment was very technical in what it had to say about the software.
Presumably, there are technical people that use video editing software. It's fine. But, that type of highly technical person may not be the target persona of this software.
HN discourages calling things shills, so I wanted to provide a clearer explanation of what is my best interpretation.
This looks amazing for audio. There's no doubt that this will massively improve podcasts, radio, etc.
I can't imagine watching a video that's been chopped up like that would be a particularly nice experience though. Editing video to remove unwanted sections and not have it look like people's heads are jumping around weirdly is really hard. Cuts are really noticeable. If they've managed to fix that with ML it's going to have a huge impact on the cost of video production.
I watched a youtuber go through the whole flow of producing a video using Descript and it doesn't do anything in terms of real video editing. It's useful for removing bad takes out of a longer video, but then you need to export the chopped up product and spend a lot of additional time fixing it up and makeing it flow seamlessly in a tool like AfterEffects or DavinciResolve.
I wonder if deep learning algorithms such as worldsheet [0] would help in simulating multiple angles, so the program can switch from one angle to another on cuts, to make them less jarring ...
In the video you linked it looks very natural. I think that's mainly because the cuts are between sentences. Not sure if it would still look good with mid-sentence stop words cut out.
Our stack isn’t super novel. We’re trying to make sure we don’t write ourselves into a corner with a stack that no one else uses, so we use Typescript/NodeJS + Kubernetes for much of our backend services, and Python for our ML pipelines
On the client side, we use Typescript and Electron, which allows us to have most engineers work across the entire stack and cross-platform
We don't use your Project Information for anything other than providing the service we offer — e.g. we don’t sell it; we don’t use it for marketing; we don’t use it for advertising.
This is a strong statement, which is nice. Only covers the projects themselves, though.
Under what's shared:
Google Cloud Speech-to-Text to provide automatic transcription
Google only accesses or uses your data to complete the automatic transcription service. Shortly after completing the service, Google deletes your data from its servers.
As the only HIPAA-compliant automatic transcription service, Google is an extremely privacy-friendly transcription service.
And then there's a bunch of other integrations with Google/AWS/others.
I understand some of the issues were fixed bugs, but I don't know if selling Google integration as a privacy choice is honest, given Google's business model.
There's a difference between Googles consumer facing products where all these services are availed for free vs. enterprise clients using gcp - it'll be very surprising at the least if they don't honor these terms with their clous speech to text offerings.
I've had to do a lot of videos of my talks this year for virtual conferences. This would be useful to eliminate all of those ums and ahs. My 20-minute presentation would likely get cut down to 15...
I'd be interested to know the impact of cutting out the ums and its impact on the pacing/cadence of a talk.
From the pricing page, it looks like even if you pay for the most expensive option, you still only get to remove their watermark for 30 minutes! If you are paying for the basic service (NOT free) you can only remove their watermark for 2 minutes!
Apart from that, it looks like an interesting product.
If you hover over the Audiograms Pro help icon on the right, (midway down in the Pro plan card) it says "Custom colors, background images, and remove the Descript logo. 30 min max size (vs. 2 minutes)"
This is not correct. The security page only shows that they use Google for Speech to Text.
Instead, Text to Speech is done using technology developed by Lyrebird.ai, which Descript has bought over. Descript rebranded it as Overdub. Note that style transfer learning of voices is a hard problem and Overdub seems to have nailed it perfectly. I speculate that the underlying technology of Overdub is based on sv2tts (https://arxiv.org/abs/1806.04558).
Looks great. Would love to be able to use it without signing up for an online account :(
That is annoying in the first place, then when you try to sign up the password field has finnicky rules like requiring upper+lowercase+numbers+symbols and min length 10.
This is compounded by the fact that it does not integrate with Chrome so I can't easily get an auto-generated strong password stored with all my other ones. Nor does it allow using a Google or Facebook account to signup.
Descript is an amazing product and one of the main inspirations for our startup and product : www.spoke.app
We offer the same services of editing by high-lighting text, but differ in so far as we offer video + direct capture of your content (both your microphone and system sound). Our goal isn't as much to produce nice, clean-cut videos, than to offer to summarize any video-conversation you might have for co-workers, friends, etc.
I thought that was neat too (and the video itself was awesome), however I do with that they had hidden the gif when the video begins playing. Most modern machines won't mind, but my older machine didn't like having to play a video whilst also render an animated gif in the background - it managed, but the CPU shot up considerably.
Yeah I thought the same, many people nowadays just call decision trees and algorithms "AI", but this is some really handy stuff out of the reach of decision trees and algorithms. It understands what you say, learns your voice so you can correct what you say. Truly usefull, although AI is a big word for how they use Deep Learning of course. But ok, AI means "uses Deep Learning" I can live with that.
Have you watched the video? I disagree, it seems to sell itself pretty good. The promotional video isn't very long and it's pretty well cut (which is a good sign, for what they're trying to sell.) And it shows plenty of interesting features too, like overdub.
* Content-Based Tools for Editing Audio Stories [2] (the software is released as an open-source project called speecheditor [3])
* Text-Based Editing of Talking-head video [4]
* QuickCut: An Interactive Tool for Editing Narrated Video [5]
[1]: https://graphics.stanford.edu/~maneesh/
[2]: http://vis.berkeley.edu/papers/audiostories/
[3]: https://ucbvislab.github.io/speecheditor/
[4]: https://www.ohadf.com/projects/text-based-editing/
[5]: https://graphics.stanford.edu/projects/quickcut/