Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Aeneas – a Python audio/text aligner (github.com/readbeyond)
188 points by alpe on March 19, 2017 | hide | past | favorite | 35 comments



This is super cool. I'm trying to think of common practical applications for this - would one use this to sync a script with a performance? Could this remove a lot of the work required to manually subtitle movies, TV shows, and YouTube videos?


Thank you.

Indeed several users of aeneas adopted it for producing SRT/TTML files, i.e. captions, for videos, both online and offline --- and many of them start with an existing transcript.

However, please note that there are limitations on the amount of "non speech" that aeneas can tolerate: for example, long spurious portions of audio or sung passages might affect the quality of the alignment.

For details on how aeneas works: https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITW...


> there are limitations on the amount of "non speech" that aeneas can tolerate

couldn't you have as part of the input also a very simple map where users could define times that should be ignored to help with that? Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?


> Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

I would say yes and no.

Currently you can add a switch that makes aeneas ignore the audio intervals that are detected as "non speech" by the built-in Voice Activity Detector (VAD), which is a very rough energy-based VAD. For sure this is a part that can use some improvement.

However, AFAIK e.g. music/singing separation is a really difficult open problem, with people in academia doing PhDs on it. So, I am not sure how far one can push this line, while staying relatively fast on a regular machine. (Which is one of the goals of aeneas.)

> And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?

Besides converting the input audio file to mono 16 kHz 16 bit WAVE, I do not perform any other operation on the audio data before passing it to the MFCC extractor (which by default runs with "standard" settings, but the user can change them).

Unfortunately, I have had no time to perform an exhaustive search of the parameter space, nor to try other pre-processing techniques.

But for sure if you have means to "pre-clean" the audio file before feeding it into aeneas, that is probably going to improve the quality of the output alignment.

(I did play with amplitude normalization and it did not seem to improve the results. The non-speech masking mentioned above seems beneficial if you do word-level alignment.)


Definitely.

Actually, aeneas can be used as a Python library (rather than just a CLI tool), and you can definitely provide an audio file, a list of audio intervals where the spoken text is, and align "piece-wise". See the "aeneas library tutorial" in the docs.

At the moment, the CLI tool aligns only a single audio interval (possibly chopping the head or the tail of the audio file) --- which is just a special case of the above case.

I remember a user requested this feature in the past. I have not added it yet because:

1. I have not heard much interest about it, and I have not needed it myself;

2. I am not satisfied with the current CLI interface --- (historical reasons mandated it the use of big config strings and strange, long parameter names) --- and hence I think that this kind of new features should be added once aeneas 2.x is out, with a redesigned CLI.


When I was an undergrad freshman, I took a job with a research group as a data annotator. My job was to go through the Switchboard corpus (recordings of hour-long phone calls that people agreed to have recorded, in exchange for having the long-distance charges paid) and label features such as who was speaking, whether the pitch of the voice was rising or falling, whether the vowels were elongated, vocal fry, and stuff like that.

But the most time-consuming and mind-numbing part of it was just annotating the words in the sound file.

The interface for all of this was a terrible GUI hacked in on top of some Solaris sound editor, and it couldn't do things for you like find the moments that words began, or say "hey the pitch is obviously falling here" because frequency tracking is a thing computers can do, or anything.

There's still a lot more voice data to annotate in the world, and maybe having a flexible Python tool like this will make the next undergrad doing the grunt work much more effective at it.


I agree on most of your observations.

However, please note that other tools are better suited than aeneas if one wants to align at phoneme level: gentle, Kaldi, SPPAS, etc.

aeneas' goals are covering as many languages as possible, fast computing, targeting (sub)sentence granularity (e.g., ebook-audiobook or closed captions). Phoneme-level annotation really requires more sophisticated techniques, like HMM/GMM/NN as implemented by the tools mentioned above. Yet, aeneas can be used to quickly bootstrap e.g. a manually-reviewed alignment.


There's nothing new about this, it's how speech recognition training data has been generated for a long time. Whether you can align a script will depend on how accurate it is and how expressive your models are for generating spoken/surface form alternatives for the ways things like dates are verbalized, which look different in text. If more than one person is speaking at the same time the results will be terrible.


I would like to note once again that aeneas is not based on automatic speech recognition techniques, but on MFCC + DTW, which is an even older approach, with pro's and con's.

Interestingly, there are situations where ASR-based forced aligners seem to be tricked into error, while aeneas handles them more robustly --- for example, if the speaker repeats a word in the spoken audio, but the transcript has only one occurrence, or when the speaker mumbles (uhm's, ah's, etc.). On the other hand, it is true that if you want word- or phoneme- alignment, ASR-based aligners outperform aeneas.

Finally, let me note three major goals of aeneas are: 1. be able to process hours of audio relatively fast on a standard PC (the current real time factor is between 0.008 and 0.020); 2. easy to install and run (unlike many other open source aligners derived from academic projects, which require a PhD just to get the dependencies right); and 3. working out-of-the-box for many languages, including ones that are not covered by academia or commercial solutions because they are "minor" (say, Icelandic or ancient Greek (!)).

But yes, the core algorithmic approach of aeneas has been around since the 1970s.


I had a vague plan to start working on something like this recently with the idea that I could automatically take audiobook media files and their accompanying ebook representation and use it to automatically re-divide the file by chapter (or using something based on chapter). Not sure if this will work well for that (or if my use is considered "common"), but I'm certainly glad to see it.


I have used aeneas myself to do it, with mixed results. You will probably need to increase the DTW margin. Also note that you will need a lot of RAM --- say 16 GB if you plan to work on a single audio file with duration 10-15 hours, which is typical for an audiobook.

In theory one can perform the DTW out-of-core, saving the accumulated cost matrix and path to disk, but I have had not time to implement this yet (i.e., the accumulated, reduced DTW cost matrix should fit into RAM). I tested it can be done with PyTables, but it will probably come with the next major version of aeneas (v2).

BTW, if your goal is to split, say, chapters of an audiobook, probably there are more efficient ways of doing this. For example, finding the long silence intervals between chapters might be enough. Or, instead of aligning all the text against all the audio, just perform a "partial matching" of the first sentences of each chapter against the audio.


Yes, thanks for the feedback! I wasn't planning on feeding it the entire audiobook and trying to align the whole thing (though there are other reasons you might want to do something like this) - I figured I'd use some heuristic methods to detect chapter breaks (like long silences), then try as you say partial matching to figure out which ones correspond to what chapters (or which ones correspond to chapters at all). Like I said, it was a vague plan, but when I've played around with running things through speech-to-text in the past I haven't had excellent results. I was hoping something like this (where you have the speech and the text and just want to know how they line up) would end up being much moreo accurate.


You are welcome.

Using a forced aligner usually improves the results a lot when compared to using an automatic speech recognition system --- because adapting the language model to your specific text prunes a lot of choices w.r.t. a generic language model which is supposed to cover any kind of text in that given language.

Anyway, if you feed aeneas an audio file < 2 hours, 4 GB of RAM should suffice, and the default parameters should be good as well. If you just need to recognize the splits doing a full alignment is an overkill, but I guess you will happy to "waste" 5 minutes of computation time instead of spending more time implementing your own code.


This is going to be beyond useful for me. I can extract far more labeled audio samples for my Donald Trump text to speech engine [1]. Thanks for sharing this!

[1] http://jungle.horse


Have you looked at applying the techniques used in Google's Wavenet to your corpus? In addition, any interest in releasing your corpus?


This code is also useful <https://github.com/lowerquality/gentle>


Yes, there are several other open source aligners out there, mostly from academic research or derived from academic projects. In my personal GitHub page I have a repo with an annotated list of forced aligners. (If I add a link to it, the spam detector triggers ?! Anyway, google "github forced-alignment-tools" to find it.)

Gentle, which is based on Kaldi, has a good performance, and an handy setup script.

However, these aligners, which are based on automatic speech recognition techniques, have pre-trained models only for English and maybe an handful of other "popular" languages. Some allows you to train your own language model, but very few users have the actual competence/resources for doing that.

aeneas is build using an older approach, which has the advantage of requiring weaker language models, that are already available (in the form of TTS voices): this is the reason why it "supports" so many languages. Of course the disadvantage is that aeneas works decently well at (sub)sentence granularity, but worse than ASR-based aligners at word granularity or with more noisy audio files.


Do you know of any existing forced alignment tools that work well with live audio (microphone) input? I would like to create a live stream in which the words of a known text are displayed as they are being spoken into a microphone.


For sure aeneas is not suitable, since it requires all the text and all the audio in advance.

But ASR-based tools in theory would allow such an operation mode, but I have not seen aligners that read from the mic buffer directly or have a built-in option/CLI for it.

Knowing the text in advance basically means that you can train your own language (textual) model adapted to that exact text, and then use the (standard) acoustic model for your language and aligning procedure as usual. Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it. Perhaps gentle (which is based on Kaldi) is worth looking into.


I looked into gentle a few weeks ago and did notice that it seems to use an online algorithm. It doesn’t have built-in support for live audio input unfortunately, but it may be tweakable as you say (such as reimplementing it to use audio streams that work with either static or real-time input). I guess there’s no other way to find out than just try it myself.


Another possibility is to just run an automatic speech recognition system (e.g. Sphinx or PocketSphinx can read from the mic input), and align its output with the ground truth text.

You need to deal with imperfect matching because the ASR might produce a text slightly different from the ground truth, but if you want to chunk e.g. at sentence granularity (and then move on to the next sentence), you should be able to do it in real time.


Thanks for creating this. I can imagine a not-so-distant future where thousands of random video-watchers could annotate tiny parts of videos via some free-form box, and aeneas could clean up and formalize this into an official transcription. Seems like a minor feature, until one realizes how much the public just lost due to missing transcriptions: https://www.washingtonpost.com/local/education/why-uc-berkel...


Thank you.

Indeed, while aeneas was created for ebook-audiobook synchronization, several of its current users are producing closed captions --- because, in most cases, they already have a clean transcript (e.g., speakers provide transcripts to the captioner) or they clean up an automated transcript, derived from an automatic speech recognition system.


To elaborate a bit further, as indeed the closed captioning applications are very important, from hearing-impaired people to the dyslexic, to second language learners.

Let's think about how a human operator would create captions for a video.

If the transcript is not available, the human will roughly transcribe the video (speech to text/speech recognition), and if expert, it will also segment it into closed captions at the same time (segmentation). Note that the segmentation usually needs to follow certain constraints like a maximum number of characters/second (otherwise the CCs are too long/fast to read) and it might also condense the words actually spoken into less verbose text. On top of this, there are special cases, like marking dramatic pauses or laughter or describing on-stage events. A human being using a CC tool would also get the time alignment basically for free, as she/he would write the CCs while watching the video, pausing it for writing the CC text, and so on.

If the transcript is available, it needs to be segmented into CC (same issues as described above), but once done, a forced aligner like aeneas can be used to get the timing automatically. This is the typical scenario for the aeneas users interested in CC production.

Now, let's think how machines can produce CCs.

If you use speech recognition --- like the auto CC on YouTube --- you can get the transcript automatically (usually with transcription errors, especially on languages less trained), with the timings as well. Segmentation is performed automatically as well in a greedy-like fashion driven by the audio signal, but usually is way inferior than the one produced by an expert captioner. The advantage is that the entire work flow is automated.

However, if some manual labor can be applied, perhaps the best flow is the following: use an ASR to get a rough transcript (e.g., download the auto-CC from YouTube or run your ASR of choice), manually clean it, segment it into CCs [1], and then use a forced aligner like aeneas to get the timings. This flow is available e.g. in the aeneas Web application at [2] and the users say it is faster than writing the CCs from scratch. I would say it strongly depends on whether the ASR phase produces a decent transcript or not.

[1] actually, I am working on an ML-based, NLP library to automate the segmentation (i.e., going from a raw transcript to a sequence of CCs respecting the constraints described above).

[2] https://aeneasweb.org


This is really cool!

Would you like me to make a conda package for this? I can do so for Linux and OSX so that someone who uses python for data science can do `conda install aeneas` and it will install this and it's dependencies into a virtualenv.

I'd do it on windows too, but I don't know of an easy way to get my hands on a windows box. If anyone knows of a service that can give me 30 minutes of CLI access to a windows box, I'd be grateful.


AppVeyor has free Windows CI for open-source projects (which probably would be a good idea to set up for later updates), and they explicitly mention that you can install remote access tools in their build VMs to work on/debug the build: https://www.appveyor.com/docs/how-to/rdp-to-build-worker/


Hi, thank you.

Having it on conda would be great (https://github.com/readbeyond/aeneas/issues/158 ), so if you feel like it, it would be wonderful!

The two points that proved difficult in packaging aeneas (as self-installers and as .deb for Ubuntu/Debian) are:

1. the package must also install/require-as-dep ffmpeg and espeak 2. the package must trigger the compilation of the Python C extensions as described in setup.py

Unfortunately I am a Debian/OS X user (I do not even own a Windows machine right now), but I am told that one can use the Win10 IE VirtualBox images.


"Audio is assumed to be spoken: not suitable for song captioning"

Can anyone recommend alternative approaches for music lyrics alignment?


What's the accuracy level of alignment?


aeneas is not based on ASR (i.e., it does not try to "recognize" words and align them with the input text), but on the "older" MFCC + DTW approach.

Hence, it is difficult to give you a precise answer, e.g. in terms of word-error-rate or similar metrics.

For the task aeneas has been designed for --- aligning an ebook and the corresponding audiobook --- and for similar tasks (e.g., captioning videos of lectures or spoken-only content), it generally produces an alignment that is indistinguishable from a manually-produced one.

If you want to see some examples, read+listen one of these audio-ebooks: the alignment has been produced by aeneas: https://www.readbeyond.it/ebooks.html

But of course if you want to align at finer level (word) or a more noisy/non-matching audio, the quality of the alignment can deteriorate.


Thanks for the explanation. Will it work if there are gaps in the transcript? Eg, the clean verbatim transcript where the ah's and uhm's are left out.


Several users of aeneas interested in producing caption files for videos told me that it does. And considering how DTW works, it is plausible.

Unfortunately, I have not had the time to setting up a suitable corpus and performing a rigorous evaluation to comfortably answering your question with a definitive answer "yes".

Perhaps the best option to see if aeneas works for your use case, consists in trying it out.

If you do not want to install anything on your machine, you can use the aeneas Web app: https://aeneasweb.org --- basically you submit an audio file (or a YouTube URL) and a text file, and get a SRT/TTML/etc. file emailed back.


I definitely plan to try it soon.


Reading my name (spelled correctly, cudos for that) on Hacker News feels really weird


In Italian high schools "Licei" we take 5 years of Latin (and also ancient Greek if you choose the classical study path)... nice to meet you!




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: