I take it you haven't been following the field lately, have you? It is surprisin...

GistNoesis · on March 12, 2019

There is also the transformer approach (eventually with local attention to bound latency), (like I'm doing in my project (Work in Progress) : https://github.com/GistNoesis/Wisteria/blob/master/SpeechToT... ), though it's in the same line of thought as the convolutional CTC.

The RNN-T is a nice idea though, if I understand it correctly it's another approach to the alignment problem. In CTC, you are generating sequence like TTTTHHHEE CCCAAATT, which mean that your language model must deal with these repetitions, and you can't train using text without repetitions. In RNN-T you are learning to advance the cursor on either audio sequence or text sequence so as to maintain alignment, kind of like you do when you merge-sort two sorted lists, therefore it outputs THE CAT, and you can use a standard language model.

JulianSlzr · on March 13, 2019

We explored self-attention + CTC in our ICASSP 2019 paper (https://arxiv.org/abs/1901.10055). Our implementation uses internal infra, so we're a ways from releasing code :(

Hoping paper details suffice and help with the parameter search, and happy to respond to questions over e-mail. Would love to see an open-source implementation with local or directed attention built out!

p1esk · on March 13, 2019

Very interesting! Perhaps I should revise my original comment to “everyone is moving to transformers lately” :)

p1esk · on March 12, 2019

What is "convolutional CTC"?

Gated convolutions as LM is similar to RNN-T idea [1], but you have to deal with softmax, so I'm not sure how well this would work in practice, especially on a mobile processor.

[1] https://arxiv.org/abs/1612.08083

blcArmadillo · on March 13, 2019

I'm guessing Connectionist Temporal Classification: https://www.cs.toronto.edu/~graves/icml_2006.pdf

vsef · on March 12, 2019

The paper you site gets equal to SOA performance on Wall Street Journal and LibriSpeech test sets, both of which are clean, read speech, and not at all representative of what a phone or assistant recognizer deals with. The convnet described there also does not perform streaming recognition.

The primary reason to be interested in convnets for speech is computational parallelism, not because they have especially strong results for accuracy.

p1esk · on March 12, 2019

I'm curious why people think that recurrent architectures are somehow more noise-tolerant. Where did this come from?

vsef · on March 12, 2019

They're not that I know of. But the paper cited is just showing relative parity or slight improvement on relatively toy examples. The claim that convnets are the clear winner for speech in general/what everyone is doing now is just not true.

I work in the field, a more accurate summary would be that there are a number of viable architectures that currently get fairly similar accuracy, but that have other pros/cons with respect to streaming, memory use, parallelism, model size, integration with external language models and context, complexity of the decoder, friendliness to different types of hardware etc.

p1esk · on March 13, 2019

Ok, so what are the advantages of RNN over CNN based models for speech to text, with respect to any one of those factors you mentioned?

vsef · on March 13, 2019

Well for example, some comparisons to the CNN paper you pointed to:

- No comparison is given of number of model parameters. If optimizing strictly for model size, RNNs tend to be nice and compact.

- The computational advantage of the CNN at training time is throughput. The advantage of RNN at decoding time is streaming latency. Running the CNN frame by frame as they are received removes the ability to run frames in parallel and if the CNN is larger, it will run slower, and depending on its receptive fields it may not even stream well at all.

- That particular CNN system uses a strictly external LM that is not jointly trained and has an additional hyper parameter at decoding time to weight the LM that requires additional tuning.

- It is still autoregressive in the beam search, so the LM will still be run many times sequentially adding tokens just like an RNN LM, and is likely to be more expensive. The throughput advantage a conv lm has in scoring whole sentences is totally lost. In fact, there doesn't seem to be anything special about the choice of a conv lm for that paper except that it is fun to make all the parts convolutional.

- CNNs frequently require more total flops, but are high throughput on eg a GPU because they expose so much parallelism. On an embedded CPU this can be a bad tradeoff.

As a side note, there's no reason that CNN architecture, which in the paper is trained with a close relative of CTC and is decoded identically to a RNN CTC AM + external LM, couldn't be trained as an RNN transducer. Despite the name neither the am nor lm have to actually be RNNs.

sdenton4 · on March 13, 2019

Haha, I ran a straight convolution net as an encoder for asr in a toy project while learning seq2seq. Worked fine in the small datasets I was working with, like the voice commands set...

p1esk · on March 13, 2019

Thank you for the detailed answer. This is exactly what I was looking for when starting this thread.