Building a language translator from scratch with deep learning

jeffreyrogers · on Oct 10, 2018

This is very cool. One thing I wonder about though is whether small companies will be able to compete with large ones like Google in ML in the future. One reason Google's translator is better is because they have way more data. In the past they digitized tons of books so they have an excellent dataset that has been translated by professional, human translators. This data collection is effectively cross-subsidized by Google's primary business: advertising.

Since most competitors to Google offerings aren't going to have a hugely profitable core business with which to fund all the data collection and normalization that goes into building a high quality ML system, the future for poorly capitalized competitors to compete seems bleak to me. This seems to support some of the growing rumblings about enforcing antitrust laws against the large tech companies.

Edit: better, not bigger.

l9k · on Oct 10, 2018

DeepL had a lot of good press when it came out last year. Some saying it was better than Google.

https://www.deepl.com/en/translator

akie · on Oct 11, 2018

Wow, thank you for mentioning that. I cannot believe how good the translations are! My native tongue is Dutch and I threw in some (long!) English, French and German texts and honestly, they read like they were written by a native speaker. Hugely impressive.

thirdsun · on Oct 11, 2018

When someone recommended DeepL to me I almost didn't take it seriously expecting mediocre translations that aren't bad, but hardly usable without heavy editing. However after trying it I'm very impressed with its results and in those cases where you have a better translation in mind the interface offers an easy way to suggest and replace expressions. It's impressive.

saip · on Oct 10, 2018

Access to parallel corpora is a limiting factor in general. A good way to train a language translator is to use an open source dataset (several here http://opus.nlpl.eu/) to train a base model, and then fine-tune it with a smaller dataset specific to your domain.

In this case, the author claims pretty good accuracy, almost on par with Google Brain's!

  On my test set of 3,000 sentences, the translator obtained a BLEU score of 0.39. This score is the benchmark scoring system used in machine translation, and the current best I could find in English to French is around 0.42 (set by some smart folks as Google Brain). So, not bad.

jeffreyrogers · on Oct 10, 2018

Wow, missed that part when I read it. Pretty incredible that using open source data you can outperform the state-of-the-art machine translators of a few years ago.

zawerf · on Oct 10, 2018

For a historical perspective check out stanford's nlp course: https://youtu.be/IxQtK2SjWWM?t=1267

Deep learning only started beating tradition methods in 2016!

virgilp · on Oct 11, 2018

EU helps with this too, accidentally. All official documents are translated in all EU languages, with very high quality translators. And all these documents are public.

batterseapower · on Oct 10, 2018

The value of large corpora for translation may be diminishing.. In particular, Facebook have achieved impressive results using unsupervised ML for translation: https://code.fb.com/ai-research/unsupervised-machine-transla...

The basic idea is to use word vector embeddings to build a source<->target dictionary, then combine this with a language recognition model to iteratively bootstrap a set of source<->target training examples for use with a conventional ML approach.

sooheon · on Oct 11, 2018

So the value of a large corpus remains, it's just this one happens to be generated, as opposed to collected.

kome · on Oct 11, 2018

Deepl is already MUCH better than Google translate: https://www.deepl.com/translator

pixelHD · on Oct 10, 2018

Another perspective in similar veins would be the rise of AutoML. Given its absurdly high computational cost, I'd think only enterprises with massive computational power at their disposal would be able to use it.

pixelHD · on Oct 10, 2018

The transformer paper was quite influential in machine translation space. This resource [0] posted here a while back is a good place to learn and get a better idea how it works.

[0]: http://nlp.seas.harvard.edu/2018/04/03/attention.html

lucidrains · on Oct 10, 2018

one of the best visual tutorials on the transformer I came across http://jalammar.github.io/illustrated-transformer/

pixelHD · on Oct 11, 2018

Wow, that does look really good! Thanks!

lucidrains · on Oct 11, 2018

you're welcome! :)

i_made_a_booboo · on Oct 10, 2018

Machine translation has made some pretty impressive progress over the last decade. Unfortunately no methods will ever cover the very last mile as languages don't have perfect 1 to 1 mappings. Though it is amusing watching the machines try.

jchw · on Oct 11, 2018

If we ever do solve the last mile, it would probably be one of the less interesting consequences, as it would probably imply we've built an algorithm capable of learning and thinking to a similar degree of a human.

To that, though, I'm definitely not holding my breath :)

i_made_a_booboo · on Oct 11, 2018

The last mile isn't solvable. Some languages contain concepts, set phrases, vocabulary and pop-culture references entirely unique to that language. There isn't a translation in every single case. Machines however will always try to come up with one and the results are amusing.

Also people make the assumption that as soon as we make strong AI comparable to a human we will be to translate anything and everything (let's say we are excluding the last mile for arguments sake). That assumption ignores an important fact that sometimes translation is a team effort where certain words, phrases or concepts are debated among multiple translators to reach a consensus. It's not always done by a single intelligence.

Some people might argue that's because people have far more limited capacity to consider all the examples in the corpus whereas a machine can consider all of lightning fast and thus can arrive at the right answer.

A perfect edge case that illustrates why that doesn't matter and where multiple human intelligences will often grapple with how something should be translated would be what name to give to a movie you are translating to an international audience. The same movie often has quite different names depending on which language it gets translated into. There isn't actually a correct answer there is just answers that are deemed 'good enough'.

jchw · on Oct 11, 2018

You know, though, machine translators have long been able to make subjective choices in translations. We deem them correct because a human can verify that the translation carries roughly the same intent, meaning, tone, etc. Not because it matches exactly what a human says.

Secondly, you are conflating concepts in my opinion. Localizing a movie may involve translators translating lines, but it also involves the creative work of localizing the title and other things, as you mentioned. A machine translator by today's definition translates a string of text in one language to a string of text in another. We needn't consider every type of work a human translator might do; it would be quite enough of a difference to close the gap on translating strings straightforwardly.

i_made_a_booboo · on Oct 11, 2018

This presumes you can translate all strings straightforwardly. You can't. There are times where I've been given a string and had to have an in-depth 30 minute discussion to understand enough of the surrounding context to be able to spit out a result. In certain cases no mapping exists.

Also, anyone who is able to verify that a translation conveys a meaning in enough of the same direction as the original utterance by definition doesn't need a translation as they know both the source and target language.

It's everybody else who is not able to verify for whom the accuracy matters for they have no recourse but to trust it.They are frequently led astray.

A couple of examples to illustrate.

掘り炬燵 (horigotatsu) is a noun referring to "low, covered table placed over a hole in the floor of a Japanese-style room"

Now, given this is something that doesn't exist in any Western, English speaking country it simply doesn't have a mapping in English. The best that can be done is to give an explanation of what it is.

Google translate "translates" it as "digging". Welcome to the last mile. In this case Google should just spit out an explanation of what it is. Digging is entirely incorrect and unhelpful.

But it gets worse. Imagine if it's used in a sentence. Here is a good example of a last mile issue in translation. It's impossible for you to translate it directly, so you have to fall back to a best effort attempt and either simplify and lose some information or stop mid-sentence and give an explanation of what the thing actually is.

掘り炬燵に座ってご飯を食べてた。

This sentence is all kinds of problematic from a translation point of view.

Google translates it as: "I sat on a digging stone and ate rice."

That borders on D+/C- in terms of quality for me. But there are a few good reasons as to why.

The original Japanese doesn't give the context of who is performing the action because that's simply not necessary to say in Japanese it's almost always just inferred from context in the moment and that gets lost when you only have a string. Thus it's possible this could be a "he, she, it, we, I, they". If the machine is forced to pick one option then it will pick one option.

Then there is the horigotatsu part which gets "translated" as "digging stone". What the hell is a digging stone? It ought to just say horigotatsu* and have a footnote. Machine translation today doesn't do footnotes. I wish it did.

Again there is a lack of context as to the meaning of ご飯 (gohan) which technically can mean cooked white rice but in this case most likely refers to a "meal". Though which meal is not specified and it could be breakfast, lunch or dinner but I'm going to guess it's dinner.

But what should the translation actually be? Is it even fundamentally "translateable"?

One valid translation would be "we sat in the horigotatsu and had dinner". That still requires an explanation.

Anyway, I hope it's a little clearer what I mean that it's not actually always possible to translate things.

I think we can hit parity with humans one day, but it requires fundamentally rethinking certain things at a UX level. For instance if instead of just an input form Google translate was more like a chatbot that could probe for more context when needed that's more my idea of where things need to ultimately wind up. Perhaps a model like rap genius where annotations contain extra details around possible alternatives and why the current word was chosen.... This is my 2 cents on the issue.

jchw · on Oct 11, 2018

No, I am not presuming every sentence has a straightforward translation, just suggesting that a meaningful measure for the "last mile" of machine translation would be reaching human parity at that specifically.

Being able to provide additional context would be great, but I don't see why it would have to be done in a "human" way to satisfy the constraints.

kome · on Oct 11, 2018

> The last mile isn't solvable.

So true. Often the last mile isn't solvable even for highly trained humans: it's not like it's hard, it just impossible.

Translators have to take alternative ways to convey the meaning. Making choices, and often losing some parts completely.

Machines have no chances.

groestl · on Oct 11, 2018

> The last mile isn't solvable.

This does not really apply to your movie name example, but in a text machines could try to escape to a meta-level, like humans sometimes do: "There is an old gaelic saying, which, roughly translated, goes something like this..."

i_made_a_booboo · on Oct 11, 2018

That is the only end game I would consider to be a full implementation of machine translation on par with human capabilities.

In cases where there is no direct equivalent or even an appropriate localization we usually resort to giving a long form explanation. If the machines can do that, they've won.

I can't see the current techniques getting us that far. It's going to require some new, creative thinking to really push the envelope.

laichzeit0 · on Oct 11, 2018

I suppose we should just give up the whole enterprise and not bother trying at all, or what are you saying? Maybe we can't ever solve the "last mile", so what?

i_made_a_booboo · on Oct 11, 2018

I'm saying the current techniques will only take us so far and they break down in hilarious ways when they reach these edge cases.

My point is to reach true parity with human capabilities machine translation needs to reach the point where it can also recognize "I can't translate that" instead of always trying to spit out an answer and returning hilarious garbage in such situations.

Another behavior I would expect to see is sometimes returning multiple possible results. This is common when a translator doesn't know the exact context but can imagine the range of possible contexts in which an utterance can occur and conveys them all in order to get the message across.

Currently I don't see anyone trying to take this approach. It's all simple one string in gives one string out.

felipemnoa · on Oct 11, 2018

>>To that, though, I'm definitely not holding my breath :)

Nature was able to do this. Sure, it took a couple billion years of evolution to get to this point, but it is doable. I'm betting that the chances of us inventing Strong AI within the next 100 years is almost a near certainty.

nl · on Oct 11, 2018

I don’t really agree with this.

Modern neural translation techniques don’t impose a 1:1 mapping on translations, and this works reasonably well between major European languages (English/German/French/Spainish/Italian) where there are large cross language corpas and large monolingual corpa available. In these languages you’ll often get single words translated to short phrases.

It’s true that this doesn’t solve the Japanese examples you give below, but these seems a question of degree rather than seeming impossible.

i_made_a_booboo · on Oct 11, 2018

It's not about 'imposing' a 1:1 mapping. It's about assuming a mapping exists at all for every case. It doesn't. While it does for the vast, vast majority of cases every language pair will have a non zero percentage of things that have no possible mapping that one could construe as a translation.

The problem most people seem to have understanding this is two-fold:

1. They assume if you just had more data and better algorithms you could get better results.

2. They have never translated things themselves and come across a case where something didn't have a translation.

Remove the machines from the equation entirely. It's not always possible in 100% of cases for people to do it.

Naturally linguistically similar languages have more overlap and hence better success overall but that's really just a nice to have.

No matter how similar English and French are, if you ask someone to translate a meme that started in 4chan or Reddit into French you will quickly encounter a case where attempting to do so just doesn't work. I'm sure there are plenty of better examples than that but I don't know French.

It's an elephant in the room and stunningly few people seem to see it standing there.

それは部屋の象で、驚くほど少数の人々がそこに立っているのを見ているようです。

Lol Google really??

In fact case in point 'elephant in the room' if it were used inside a joke that relied on using the elephant as part of the joke would not be possible to translate. It just wouldn't make sense.

nl · on Oct 15, 2018

I don't speak Japanese, so I'm not really sure what your complaint is.

I think you are complaining about the literal translation of what is supposed to be a metaphor. I see your point, but I'm not sure that is as big a problem as you imply.

https://www.ukessays.com/essays/english-language/newmark-and... is some useful reading here.

a_imho · on Oct 11, 2018

Though it is amusing watching the machines try.

I remember watching a program back in high school subtitling music videos with their lyrics machine translated from English to Hungarian. The absurdity was indeed hilarious for a brief period.

psergeant · on Oct 11, 2018

The grammar correction in Google Translate is a little too good. I was trying to create some broken Russian phrases to send a Russian friend, but I’d put in weird or bad English as an input and get very good Russian as an output!

skookumchuck · on Oct 11, 2018

I find that google translator does very well when the text to be translated has no spelling errors and is grammatically correct. Add any errors, and it falls to pieces, even though a human reader doesn't have any issues with it.