Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Descript – A collaborative audio/video editor that works like a doc (descript.com)
216 points by zaroth on Jan 5, 2021 | hide | past | favorite | 79 comments


Maneesh Agrawala [1] and his group has done lots of similar work with the same basic idea of editing audio or narrated video as if it's text:

* Content-Based Tools for Editing Audio Stories [2] (the software is released as an open-source project called speecheditor [3])

* Text-Based Editing of Talking-head video [4]

* QuickCut: An Interactive Tool for Editing Narrated Video [5]

[1]: https://graphics.stanford.edu/~maneesh/

[2]: http://vis.berkeley.edu/papers/audiostories/

[3]: https://ucbvislab.github.io/speecheditor/

[4]: https://www.ohadf.com/projects/text-based-editing/

[5]: https://graphics.stanford.edu/projects/quickcut/


The video is one of the best I've seen, really makes me excited for the product.

As for the product itself, I think the biggest "feature" is the ability to cut the audio by cutting the transcript, which makes it easier to quickly edit files. Transcribing is pretty common, the dubbing also sounds interesting but depends on how good the quality is.

I think the use-case for this is not YouTubers who expect high-quality, but social-media users who want to generate more average-quality content in a short amount of time.


That is BY FAR the best product video I've ever seen. Wow. "Really makes me excited for the product."... well said! I think that's what marketing is supposed to do.


I don't even have to look at the credits to know who put the video together - https://sandwichvideo.com/

They are f*king awesome and produce the best ads/product videos i've ever seen in my life. Unfortunately extremely expensive (the cost of 2 Ferraris) - unaffordable unless you are a SV startup that has raised $millions :)


Yes! Extremely helpful and time saving. The app has changed since then but a few years ago I had a blast using it.

You can also indicate who is speaking as you generate a transcript. I would then export a closed caption file and use ffmpeg to generate the video based on who was talking before merging it with the audio (think no budget v-tuber, just a different image on screen depending on who’s voice is playing)


I think this has huge potential for the education market especially in post-pandemic world where remote learning is more or less accepted.


The download page doesn't show anything particularly helpful if you're on an unsupported platform (eg Linux). From the source:

window.onload = function detectOS(){

   if (navigator.userAgent.indexOf("Mac")!=-1) window.location.replace("/download/mac");

   if (navigator.userAgent.indexOf("Win")!=-1) window.location.replace("/download/windows");

   if (screen.width <= 992) {window.location.replace("/download/other-device");};

   return undefined;
}


It's a very cool product, which I've only used briefly.

However, product aside, their promotional videos are _phenomenal_. Not sure if they are making these in-house or some company is putting them together, but someone is doing a great job.


Both this video (https://sandwich.co/work/descript-video/), and their original promotional video (https://sandwich.co/work/its-how-you-make-a-podcast/), were done by Sandwich.


Just watched a few more of their videos and they are indeed great: https://sandwich.co/work/category/featured/

But--potentially dumb question--where the heck would you encounter these videos in the wild? These sadly aren't e.g. the types of video I see as Youtube ads.


That's amazing, so fun to watch and feels 0% forced. I'm impressed.


Could anyone spit ball the cost of such a video?

I'll try: $100k (USD)

Higher or lower?


From their FAQ [0]:

“Typically, $200K is a good starting point, though we’ll also work with a lot more. If you’re a startup with an idea we absolutely have to get behind, we can get creative with equity.“

[0] https://sandwich.co/faq/


Depends entirely on who you hire and how much you let them know about your budget. I am confident that I saw a HN post years ago that mentioned Sandwich starts at $250k but again client work is almost always very negotiable.


Wow, it’s good but $250k is a shedload of cash, you could hire 3 people in this field for a year for that (certainly in London). They could do all your training material too...


I'm sure it blows up when you need actors and studios.


I’m not sure you’ve noticed but there are more actors than there are jobs. The actual shoot time for something like this is what 2-3 days. Let’s budget a crazy $50000 for those things (reality - you’d be hard pushed to spend $20k).


The FAQ answer that says $200k should probably be authoritative, but... I'm just guessing Sandwich is actually surprisingly flexible about budgeting.

They do have a series where they shot three videos at three different budget levels ($1k, $10k, $100k) for the same product (https://sandwich.co/clients/wistia/). Take from that what you will...


This is amazing, I wonder how I can do this offline, using open source tools.

Are there any really good open source speech to text programs? I imagine it's going to involve a pre-trained neural net.

[update] Following a thread https://news.ycombinator.com/item?id=20097542

It looks like I might be able to do this (speech recognition) in less than real time (because I don't have a GPU) using https://github.com/mozilla/DeepSpeech


> Are there any really good open source speech to text programs?

I've looked into the field this year (exploring to build a product in a similar niche to Descript), but everything I've encountered and tested is severly lacking (including Descript).

There are no good text(!) speech recognition programs in general. This is in contrast to sentence speech recognition which is decent.

Once you go beyond a single sentence you encounter a lot more problems which are generally under-researched (or at the minimum under-productivized), like sentence boundary detection, punctuation, etc..


Yes, there are really good open source speech to text tools (automatic speech recognition (ASR) is the common name for that).

Kaldi (https://kaldi-asr.org/) is probably the most well known, and supports hybrid NN-HMM and lattice-free MMI models. Kaldi is used by many people both in research and in production.

Lingvo (https://github.com/tensorflow/lingvo) is the open source version of Google speech recognition toolkit, with support mostly for end-to-end models.

ESPNet (https://github.com/espnet/espnet) is good and well known for end-to-end models as well.

RASR (https://github.com/rwth-i6/rasr) + RETURNN (https://github.com/rwth-i6/returnn) are very good as well, both for end-to-end models and hybrid NN-HMM, but they are for non-commercial applications only (or you need a commercial licence) (disclaimer: I work at the university chair which develops these frameworks).

Wav2Letter (https://github.com/facebookresearch/wav2letter), the tool by Facebook.

These are probably just the most well known. There are many others as well. DeepSpeech is inferior to all the ones above, but maybe simpler.

Important is also the data to train these, but you will find quite some resources online for English, e.g. Tedlium, Librispeech, etc.

You will find lots of resources actually for ASR. Some random links:

https://github.com/gooofy/zamia-speech

https://commonvoice.mozilla.org/en/datasets

https://www.openslr.org/resources.php

To add: If you want to do sth like Descript, you are mostly also interested in accurate time-stamps of the recognized text (start and end time of each spoken word). The end-to-end models are usually not so good at this (the goals is mostly to get a good word-error-rate (WER)). The conventional hybrid NN-HMM is maybe actually a better choice for this task.


The user quotes sound almost too good to be true. When I think of the average YouTuber discussing video editing software, the words "best productization of machine learning I've ever seen" do not come to mind.


Some of the HN comments here are suspicious as well.


I described it in multiple calls at Stripe today as "The most impressive demo I have seen in years", and I think you can reasonably understand that I didn't join HN 10 years ago as part of a stealth mission to a) get hired by Stripe and then b) talk up audio editing software to them so that my shadowy masters got that sweet sweet podcast budget.

Here's some work from yesterday where I trained an ML model to simulate me and then used it for a point edit and then full-text-to-speech. https://twitter.com/patio11/status/1346238460969959424?s=20


I just downloaded it and tested it out. It really is a very cool, polished product.

The video cuts when you remove "uh" and "um" are too jarring for anything professional, but OTOH for some internal work stuff I could see a use for this. And as someone who has used 0 video editing software, I was able to start making changes within like a minute, which I think is quite impressive.

That said, not sure I have enough use case to pay for it, personally.

NB: it's kind of impossible to prove you're not a sock puppet, because of _course_ a sock puppet account would say they're not a sock puppet account.


If I could revise my previous comment, I didn't want mean to imply the feedback wasn't from a legitimate user. Rather, the comment was very technical in what it had to say about the software.

Presumably, there are technical people that use video editing software. It's fine. But, that type of highly technical person may not be the target persona of this software.

HN discourages calling things shills, so I wanted to provide a clearer explanation of what is my best interpretation.


I disagree. The product is just that cool, that's why there are so many positive comments.

The website and product look sexy, the promo video isn't cringe and their "overdub" functionality looks like black magic.


This. I've never seen hners describe a product as "sexy" as many times as in the comments here.


> I've never seen hners describe a product as "sexy" as many times as in the comments here.

There are just 2 comments with the word 'sexy'. 4, if including your comment and mine.


I counted as well, pedant that I am, and I still believe that 2 is an unusually high number of times for hners.


This looks amazing for audio. There's no doubt that this will massively improve podcasts, radio, etc.

I can't imagine watching a video that's been chopped up like that would be a particularly nice experience though. Editing video to remove unwanted sections and not have it look like people's heads are jumping around weirdly is really hard. Cuts are really noticeable. If they've managed to fix that with ML it's going to have a huge impact on the cost of video production.


I watched a youtuber go through the whole flow of producing a video using Descript and it doesn't do anything in terms of real video editing. It's useful for removing bad takes out of a longer video, but then you need to export the chopped up product and spend a lot of additional time fixing it up and makeing it flow seamlessly in a tool like AfterEffects or DavinciResolve.


I wonder if deep learning algorithms such as worldsheet [0] would help in simulating multiple angles, so the program can switch from one angle to another on cuts, to make them less jarring ...

[0] https://worldsheet.github.io/


How is it jarring though? Most modern YouTube videos look just like this


Every modern YouTube video is cut just like this. It looks good. Jumping heads turn out to not look bad, when the audio flows perfectly.

https://youtu.be/2UP6CSZsc5o


In the video you linked it looks very natural. I think that's mainly because the cuts are between sentences. Not sure if it would still look good with mid-sentence stop words cut out.


It looks fine. They all do it. Source: I watch what too much YouTube


Thanks for posting this! It’s great when our users are excited about our product :-)

If anyone is interested, we are hiring engineers and PMs! In particular, we’re looking for senior backend engineers and full stack engineers.

https://www.descript.com/careers

I’m the tech lead for the backend/server team, happy to answer any questions as best as I can


It would be cool to know the tech stack you guys use to handle such heavy loads.


Our stack isn’t super novel. We’re trying to make sure we don’t write ourselves into a corner with a stack that no one else uses, so we use Typescript/NodeJS + Kubernetes for much of our backend services, and Python for our ML pipelines

On the client side, we use Typescript and Electron, which allows us to have most engineers work across the entire stack and cross-platform


https://www.descript.com/security

We don't use your Project Information for anything other than providing the service we offer — e.g. we don’t sell it; we don’t use it for marketing; we don’t use it for advertising.

This is a strong statement, which is nice. Only covers the projects themselves, though.

Under what's shared:

Google Cloud Speech-to-Text to provide automatic transcription

Google only accesses or uses your data to complete the automatic transcription service. Shortly after completing the service, Google deletes your data from its servers.

As the only HIPAA-compliant automatic transcription service, Google is an extremely privacy-friendly transcription service.

3 second search:

https://edition.cnn.com/2019/07/22/tech/google-street-view-p...

https://www.reuters.com/article/us-alphabet-google-privacy-l...

https://www.forbes.com/sites/daveywinder/2019/06/23/google-c...

And then there's a bunch of other integrations with Google/AWS/others.

I understand some of the issues were fixed bugs, but I don't know if selling Google integration as a privacy choice is honest, given Google's business model.


There's a difference between Googles consumer facing products where all these services are availed for free vs. enterprise clients using gcp - it'll be very surprising at the least if they don't honor these terms with their clous speech to text offerings.


Google cloud speech to text has 2 prices. You pay extra if you won't allow them to keep the data to improve the product. It adds ~50% to the price.


This is Andrew Mason's baby (of Groupon fame). They've been working on it for several years and it looks really great.


I've had to do a lot of videos of my talks this year for virtual conferences. This would be useful to eliminate all of those ums and ahs. My 20-minute presentation would likely get cut down to 15...

I'd be interested to know the impact of cutting out the ums and its impact on the pacing/cadence of a talk.


From the pricing page, it looks like even if you pay for the most expensive option, you still only get to remove their watermark for 30 minutes! If you are paying for the basic service (NOT free) you can only remove their watermark for 2 minutes!

Apart from that, it looks like an interesting product.


Where did you read that? I can't see anything about watermarking here https://www.descript.com/pricing


If you hover over the Audiograms Pro help icon on the right, (midway down in the Pro plan card) it says "Custom colors, background images, and remove the Descript logo. 30 min max size (vs. 2 minutes)"


Oh right I see

But it seems like that only applies to the "Audiograms" feature, not the videos you export from the main product (?)

https://help.descript.com/hc/en-us/articles/360042638351-Aud...


Their TTS alone is very convincing. Seems to be better than what Google and Amazon have to offer https://www.descript.com/overdub


They use Google for TTS.

https://www.descript.com/security


This is not correct. The security page only shows that they use Google for Speech to Text.

Instead, Text to Speech is done using technology developed by Lyrebird.ai, which Descript has bought over. Descript rebranded it as Overdub. Note that style transfer learning of voices is a hard problem and Overdub seems to have nailed it perfectly. I speculate that the underlying technology of Overdub is based on sv2tts (https://arxiv.org/abs/1806.04558).

The closest comparision to Overdub would be https://www.resemble.ai/.


This was hands down one of the best security pages I've ever read. Incredibly clear, concise and to the point


Another really exciting co in this space is Runway (https://runwayml.com/). Primarily focused on ML-enhanced media production workflows.


Looks great. Would love to be able to use it without signing up for an online account :(

That is annoying in the first place, then when you try to sign up the password field has finnicky rules like requiring upper+lowercase+numbers+symbols and min length 10.

This is compounded by the fact that it does not integrate with Chrome so I can't easily get an auto-generated strong password stored with all my other ones. Nor does it allow using a Google or Facebook account to signup.


s/Descript/self-promotion/g

Descript is an amazing product and one of the main inspirations for our startup and product : www.spoke.app

We offer the same services of editing by high-lighting text, but differ in so far as we offer video + direct capture of your content (both your microphone and system sound). Our goal isn't as much to produce nice, clean-cut videos, than to offer to summarize any video-conversation you might have for co-workers, friends, etc.


Here's a small demo I made for the new year, comparing speeches from different heads of State : https://cutt.ly/happy_new_year_from_spoke

Western leaders all seem to adopt the same compassionate tone, and wish to shine hope on the new year (with the exception of Trump).

On the other hand, Xi Jinping is just....self-congratulating.


Your demo takes me to:

> Oops Your browser is not supported. Install Firefox or Google Chrome.

Safari on iOS has 6x the market share of Firefox and is quite capable of playing audio and video, it’s not IE6.


:/

On iOS Firefox and Chrome will actually also fail as they're all based on the Safari Web engine.

Sadly enough we can't make Spoke work with Safari yet as we're using video specific interpeted browser instructions.


I really liked the idea of showing a gif with a 'Play with sound' button which pops up the video.

Very innovative!


I thought that was neat too (and the video itself was awesome), however I do with that they had hidden the gif when the video begins playing. Most modern machines won't mind, but my older machine didn't like having to play a video whilst also render an animated gif in the background - it managed, but the CPU shot up considerably.


Probably forced by technology, browsers don't allow autoplay with sound anymore.


"some mind-bendingly useful AI tools"

Does anyone else find this kind of "hypey" copy a put off?


After watching the video the AI tools do look mind-bendingly useful, so I take back what I said before!


Yeah I thought the same, many people nowadays just call decision trees and algorithms "AI", but this is some really handy stuff out of the reach of decision trees and algorithms. It understands what you say, learns your voice so you can correct what you say. Truly usefull, although AI is a big word for how they use Deep Learning of course. But ok, AI means "uses Deep Learning" I can live with that.


I would absolutely use this for work and personal projects. But, couldn't one easily manipulate another person's work and falsely present it?


I'm still using "mencoder -ss x -endpos y" for all my video editing because Linux video editors are unstable garbage.


It looks like there’s cool stuff in there but I’m not sure what.

I don’t think this does a great job of selling itself.


Have you watched the video? I disagree, it seems to sell itself pretty good. The promotional video isn't very long and it's pretty well cut (which is a good sign, for what they're trying to sell.) And it shows plenty of interesting features too, like overdub.


Love this just for the Sandwich videos alone!

No I gave it 15 minutes and it's amazing. Congratulations!


Anyone else having Firefox on Android crash when trying to access this website?


wow I actually watched that promotional video. I usually keep short videos in tabs for months, never watching them and then close the tab.

this one was impressive, I'm also really curious about this product too!


Please please please add more supported languages! Russian!


Hard not to read the name as "DES-crypt," although that's surely a non-issue for the target audience...


They poke fun that the name is tricky to pronounce at the end of the video.


Just think about how easy it is to make deep fakes with this.


Deserves to be on the front page. This is crazy fire.


What’s the general error rate on transcription technology at this point?


This landing page is perfect. The tech is sexy. The AI is impressive. Take my money?

Not affiliated with these guys in any way. Just saw the tweet from patio11 [1]

[1] - https://twitter.com/patio11/status/1346238460969959424?s=21




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: