Why is GPT-3 15.77x more expensive for certain languages?

lukeschlather · on April 10, 2023

I would want to see some data on tokenization for some real-world examples. "Je voudrais une pizza" actually translates more directly to "I would like a pizza" which is 5 tokens. But also I think there's some danger here in terms of this might be cherrypicking examples. Spanish is a lot more dense than English or French and might tokenize better. (I see "quiero pizza" is 4 tokens which seems like the right number of tokens to me - "quiero" actually contains "I want <present tense>") You could argue it's 2 or 3 tokens but 4 seems preferable.

For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.

But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.

function_seven · on April 10, 2023

> Spanish is a lot more dense than English or French and might tokenize better.

I'm no linguist, so I apologize if I'm misinterpreting this statement. My impression has always been that Spanish is less dense than English, only because in almost all cases, the Spanish version of product instructions is wordier. Look at the back of a shampoo bottle[0] and notice that the Spanish version is either longer, or a smaller font, to fit it all.

[0] https://i.postimg.cc/xd2X5WJN/Ghub-Fo-N11u8jz-Pjj-RDt-W-CGA9...

asveikau · on April 10, 2023

Instruction manuals are going to be translated and they're hopefully verbose such as to be explicit.

One area where Spanish is more dense is verb forms, because it retains most of the inflected verbs of Latin, whereas English has lost or merged together a lot of the historical Indo-European inflections. Speaking intuitively, I think it, like most Latin languages, tends to be a bit more verbose with noun phrases.

saurik · on April 10, 2023

Another way to measure this is speaking rate. What I remember from linguistics courses is 1) that whole different cultures seem to speak at different average speeds, the information content transferred per second of speech seems to be remarkably consistent across languages; and 2) people speak Spanish more quickly than they speak English.

dragonwriter · on April 10, 2023

It's probably not a good idea to judge the density of a language by product instructions that are probably a minimally workable translation into the language.

tough · on April 10, 2023

I just found this tiktokenizer project in GH that might be of help to you https://tiktokenizer.vercel.app/

howscrewedami · on April 10, 2023

openAI released an official tokenizer recently:

https://platform.openai.com/tokenizer

kouteiheika · on April 10, 2023

Slightly offtopic, but:

> One of the models listed above called NLLB (No Language Left Behind) has been open sourced by Facebook allowing for translation for 200 languages.

It was not. The model's weights are under CC-BY-NC, which certainly motivates commercial entities to not leave those languages behind. /s

adsfoiu1 · on April 10, 2023

It was open sourced, just under a non-commercial license.

jchmrt · on April 10, 2023

One of the criteria of the most often used definition of open source [0] is that the program is free to use and modify for all purposes. So a noncommercial license would not qualify as open source. This is also a requirement of the free software definition of the FSF, which is also often used to define free/open source.

[0]: https://en.m.wikipedia.org/wiki/The_Open_Source_Definition

rnd0 · on April 10, 2023

GNU zealouts (and I consider myself one) need to take a deep breath and re-think what is being said.

It *IS* accurate to say that the program and source must be free to use for all purposes the recipient wants including commercial.

This was an involved conversation that was had during the 90's and yes, "no commercial reuse allowed" licenses are not -in fact- free licenses. I might be wrong but I have the impression they are not allowed on debian cds/dvds for that reason.

> If I use a piece of software that has been obtained under the GNU GPL, am I allowed to modify the original code into a new program, then distribute and sell that new program commercially? (#GPLCommercially)

> You are allowed to sell copies of the modified program commercially, but only under the terms of the GNU GPL. Thus, for instance, you must make the source code available to the users of the program as described in the GPL, and they must be allowed to redistribute and modify it as described in the GPL.

> These requirements are the condition for including the GPL-covered code you received in a program of your own.

https://www.gnu.org/licenses/gpl-faq.en.html#GPLCommercially

jbverschoor · on April 10, 2023

"all purposes" cannot include any GPL licensed software because it clashes with certain licenses. So by that definition Linux is not Open Source.

The BSD and MIT licenses

chomp · on April 10, 2023

Wrong. Open Source does not have to bend the knee to whatever proprietary license you can dream up and shove into your code base.

>Source code: The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost preferably, downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

Does this apply to Linux? Check.

>Derived works: The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.

Does this apply to Linux? Check.

Please visit https://opensource.org/licenses/ to see all of the licenses that are generally agreed upon to be Open Source licenses.

detaro · on April 10, 2023

"combine specific software X with specific other software Y" is not a field of endeavor, so this argument is bullshit.

teddyh · on April 10, 2023

Non-commercial licenses are not Open Source: https://opensource.org/osd/

kube-system · on April 10, 2023

"open source" has a broader English meaning that predates the OSI for at least several decades. OSI does not have a trademark on "open source" because of this.

This is the software licensing world's version of "a hotdog is not a sandwich"

rnd0 · on April 10, 2023

"open source" has a specific, standard useage and meaning in the software world and has had since at least 1998 and to be honest probably before.

For the last 25 years, in the software field, the common usage has been the one from the OSI.

Of course contrarians will disagree. Contrarians will always be contrary and may be safely ignored.

kube-system · on April 10, 2023

And most people do not call a hot dog a sandwich. Yet, enough do that it is valid English.

teddyh · on April 10, 2023

A lot of people confuse, say, Switzerland and Sweden, but this does not make it valid to call either by the other name. Likewise, “Open Source” has a precise definition, and people being confused does not make it less so. Of course, a lot of people are not actually confused, but are engaging disingenuously in order to dilute the term, so that they can use it for their own ends.

kube-system · on April 11, 2023

English isn't prescriptive. In English, if people use a word or a phrase to mean a thing, it means that thing. OSI has a widely observed technical definition, but it is not universal, and more colloquial uses of the word are recognized by linguists because they factually exist.

https://www.merriam-webster.com/dictionary/open-source

teddyh · on April 11, 2023

It would be one thing if you made that argument about some old term, like “mountain”, or “island”; those have definitions, but the edges are fuzzy and vary, since the terms are old and saturated since prehistorical times. With “Open Source”, it’s different. The wording existed previously, yes, but only as a technical term in intelligence gathering. Applied to software, on the other hand, the term is new, created by the OSI, which gave it a strict definition from day one. People cannot have heard of the term unless it came from OSI. Any claim of deviation from the OSI meaning, then, can be simply discarded as incorrect.

This debate is beyond silly. It’s like arguing about what the rules of, say, Settlers of Catan is. The rules are the official rules which come with the box; anything else is house rules or custom rules, and cannot be used in something like an official tournament. When people say that “Settlers of Catan does this thing X”, and the official rules expressly says it does not do X, they are (knowingly or not) being misleading.

kube-system · on April 11, 2023

> The wording existed previously, yes, but only as a technical term in intelligence gathering. Applied to software, on the other hand, the term is new, created by the OSI, which gave it a strict definition from day one. People cannot have heard of the term unless it came from OSI. Any claim of deviation from the OSI meaning, then, can be simply discarded as incorrect.

All of these claims are untrue. Here is an example of open source being used to describe software in 1996. OSI was founded in 1998.

https://web.archive.org/web/20180402143912/http://www.xent.c...

> This debate is beyond silly. It’s like arguing about what the rules of, say, Settlers of Catan

Commercial board games typically use trademark law to prevent others from changing their rules. Popular games which do not have legally protected names often do have multiple sets of rules defined by different people. e.g poker.

teddyh · on April 11, 2023

> All of these claims are untrue. Here is an example of open source being used to describe software in 1996.

Interesting. The attendees of the meeting on February 3rd, 1998 certainly all seem to think that they at least independently re-invented the term, so the term can’t have been very common. The meeting was held two weeks after the announcement of the release of the Netscape source code, and the announcement did not use the term.

teddyh · on April 10, 2023

The definition of “open source” is universally agreed upon to have the OSI-defined meaning, except for some people:

1. Intelligence community people, who have long understood the term “open source” to mean a source of intelligence which is not itself secret.

2. People who, without having ever looked it up, assume it means that the source code is available for reading. These people are simply ignorant, and should be using the term “source available” instead, since it means exactly that.

3. People who want to be able to use the “open source” term for their software to gain goodwill, but don’t want to actually give all of the freedoms it should guarantee. These people are dishonest shills who try to confuse the debate in order to get away with fraudulent labeling.

(Repost of https://news.ycombinator.com/item?id=29332056)

kube-system · on April 10, 2023

> People who want to be able to use the “open source” term for their software to gain goodwill, but don’t want to actually give all of the freedoms it should guarantee.

Or is "open source" just a term for "free" as in beer software that doesn't actually give people all the freedoms it should guarantee? Because that's what the FSF thinks.

Different people have different ideas about what freedoms people "should" have. Nobody is being dishonest about software freedoms when the BSD-4-clause was written, CC0 or when they write licenses with 'no evil' or 'no nuclear proliferation' clauses.

teddyh · on April 10, 2023

> Or is "open source" just a term for "free" as in beer software that doesn't actually give people all the freedoms it should guarantee? Because that's what the FSF thinks.

No it isn’t. The OSI invented the term "Open Source” as applied to software, and they get to define its meaning as what they intended.

> Because that's what the FSF thinks.

No they don’t. The FSF completely accepts the OSI definition of the term “Open Source”: https://www.gnu.org/philosophy/open-source-misses-the-point....

kube-system · on April 10, 2023

You misread my comment. That page explains why the FSF does in fact believe that open source software does not give people the freedoms it "should" guarantee.

The freedoms that a license "should" convey is not a fact, it is an opinion. And there are more than a few valid and honest opinions that exist, even beyond the opinions of FSF/OSI/CC/UCB/USG/Apache/FAANG/whoever

teddyh · on April 10, 2023

How is that relevant? What does the opinion of FSF (about what a licence “should” contain) have to do what you consider to be the proper meaning of the term “Open Source”?

kube-system · on April 10, 2023

It is a response to your point numbered "3." above. There are honest and good-willed licenses which are not OSI, written by honest and good-willed people who disagree with OSI.

teddyh · on April 10, 2023

Yes, and? The FSF may disagree with the OSI on some matters, but the FSF does agree on the definition of the term “Open Source”, which was what we were discussing. Do you have a different definition of “Open Source” (as applied to software), and why should that definition take precedence over that of the definition from the OSI?

kube-system · on April 10, 2023

There are many people who use the term differently. I am not arguing that any of them “take precedence”.

I am saying the exact opposite: that more than one definition may be valid, because the English lexicon is both descriptive and additive.

teddyh · on April 10, 2023

To bring it back to the point: The article claimed that “NLLB (No Language Left Behind) has been open sourced by Facebook”, which is misleading, since “open source” has a strict definition, and the license of NLLB did not qualify with the very first point in the OSI Open Source Definition. Facebook released the source code, under an open license; they could even call it a Creative Commons license, which it was. But the article can’t truthfully call it “open source”, since it isn’t.

kube-system · on April 10, 2023

That isn't true. NLLB code is MIT licensed. https://github.com/facebookresearch/fairseq/blob/nllb/LICENS...

The model is CC. Because models aren't code.

teddyh · on April 11, 2023

The article said “NLLB” had been “open sourced”, not “NLLB code”.

kube-system · on April 11, 2023

OSIs licenses are only for software. If you open source things other than software, you’ll have to use a license that addresses those types of media. Which is what Facebook did. CC licenses are a popular way to “open source” non-software content.

teddyh · on April 11, 2023

You are again using the verb “open source” as a synonym for “release” or “freely license”. It is the very subject of this debate that I do not think this to be appropriate unless an OSI-compliant license it used; therefore, you can not now use it as an argument in this same debate.

kube-system · on April 11, 2023

The OSD applies only to open source software. It is nonsensical when applied to non-software works. You can’t release the source code for a language model because they don’t have source code.

HDThoreaun · on April 10, 2023

I disagree. You don’t get to decide what words mean. Open source means open source, that’s it. If you want it to mean something else you should’ve chosen a phrase that didn’t already have a meaning.

teddyh · on April 10, 2023

Sometimes old words and terms acquire new meanings. The only meaning “Open Source” had before the OSI was the intelligence “open sources” meaning. Is this the only meaning of “Open Source” you accept? If not, what is your definition, and why should that prevail over the OSI definition?

HDThoreaun · on April 10, 2023

I accept that different people think it means different things, which makes me want to create a new phrase that doesn't already have a meaning. open software? Not sure, but communication is hard when you co-opt phrases that have intuitive meaning and try to supercede that.

teddyh · on April 10, 2023

Whatever meaning you have in mind, I’m pretty sure there’s already a term for it. What is the meaning that you want to express?

naniwaduni · on April 10, 2023

Unfortunately, phrases are made of words that mean things.

morelisp · on April 10, 2023

> "open source" has a broader English meaning that predates the OSI for at least several decades.

It is also not data collected by an intelligence agency or law enforcement from public sources.

kube-system · on April 10, 2023

It isn't. But those who coined it were aware of it being used for that purpose. https://opensource.com/article/18/2/coining-term-open-source...

voxic11 · on April 10, 2023

> The program must include source code, and must allow distribution in source code as well as compiled form.

Its a model not a program. So that definition can't possibly apply to it no matter the license.

tmaly · on April 10, 2023

If there was a kick starter campaign to raise funding for a model that could be used for commercial purposes, I would definitely chip in.

arthurcolle · on April 10, 2023

Everyone I know ripped those LLaMA leak models and are using them extensively in "open source code" / commercial products - super unwise, but not sure licensing is actually slowing down progress in this field and even though I'm sure OpenAI is using alternative methods to make the language stuff work so well, I just wanted to comment on that front.

I wouldn't release a chatbot based on LLaMA 65B, because of the legal issues, I'm not sure others are using the same restraint.

flangola7 · on April 10, 2023

The law has always struggled to keep up with technology and the velocity today is higher than ever before.

By the time a copyright dispute makes it to trial these companies will be able to hide behind actual killer robots.

simsla · on April 10, 2023

That's still open sourced.

jenadine · on April 10, 2023

No it isn't

FredPret · on April 10, 2023

What an interesting aspect I haven't considered before. All the AIs will be trained on the available media - most of which is English.

I sometimes wonder what it takes to unseat a lingua franca, but it looks like we won't see that soon. English is set to dominate for a long time.

famouswaffles · on April 10, 2023

Doesn't really matter. There's lots of positive transfer in individual language learning. Competence in one language bleeds into competence in others. https://arxiv.org/abs/2108.13349

GPT-3 is fluent in many languages despite English taking up 93% of the corpus by word count. French is next with 1.8%

https://github.com/openai/gpt-3/blob/master/dataset_statisti...

Dunno the statistics of language presence with GPT-4 but it takes it up another level in terms of its multilingual capabilities.

vidarh · on April 10, 2023

I posted on another thread that not only does GPT4 handle Norwegian just fine (0.1% of training data for GPT3), but Norway has two official languages that are mutually intelligible and close enough that some would consider them dialects, but GPT can handle Nynorsk, the smaller of the two (Bokmål being the other) just fine.

Going one step further, I asked it to "translate" into both "Riksmål", an artificial conservative variant of Bokmål that basically rejects most of the last few decades worth of language reforms, as well as Romeriksdialect (dialect from the Eastern part of Norway)... For the latter it gave me a lecture about how it varies internally in the region (which is correct) and presented a "translation" of a test sentence that is recognisably one of the variants from the Northern part of the region.

Of course for these competency definitely bleeds over. They share an almost identical grammar and a majority of orthography, but I'm impressed enough it can handle Norwegian that well at all, much less that it knows the distinctions between the variants.

vintermann · on April 10, 2023

Yeah, its language skills are through the roof. There's no reason to talk to it in English. From what I can tell, it does a decent job of translating out of even languages like Southern Sami, with ~300 speakers and utterly neglible training corpus. It seems it knows enough about grammar from related languages, and can infer enough from context (and maybe even etymology) that it does an OK job.

I tested it by giving it some news articles from NRK Sápmi, and compare it with the Norwegian translation they have.

Edit: Seems I may have gotten lucky that time, it's being a lot more, um, creative in its translation now. Or for all I know it could be changes in the model.

PeterisP · on April 11, 2023

Looking at the basic ChatGPT (not GPT-4) while it can do reasonable translations for smaller languages and answer questions in them, the quality of the answers suffers significantly in my experience, if I ask the same factual question in two languages, I often see that the English one gets a correct answer while the small language gets a coherent hallucination. For big languages (French, Japanese, Spanish, etc) that's not an issue, but for the smaller ones it clearly is.

glandium · on April 10, 2023

> There's no reason to talk to it in English.

Depends what you're doing. I haven't managed to make it continue after it stopped in the middle of a sentence in Japanese, but giving it the instruction to do so in English does. In some other cases, prompting in English (and asking for an answer in Japanese) can produce better results than giving the same prompt in Japanese.

fomine3 · on April 11, 2023

reply "続けて" or "continue" works.

Generating Japanese is slower than English (it's annoying on GPT-4), that's my reason to prefer English sometimes (especially for tech topics). ChatGPT web users don't pay for each token, but API users pay for each token, so they would make different decision.

glandium · on April 11, 2023

In my experience, while "continue" can work, "続けて" doesn't. At least not when making it rewrite large texts, which is when I hit the limit. With "continue", it continues rewriting. With "続けて", it tends to make up new text, that yes, is the continuation of what it was writing, but with no connection to the original text it was in the middle of rewriting.

rapsey · on April 10, 2023

ChatGPT speaks a ton of languages and very well at that. Hell it is better at my native language than I am and I am from a pretty small country.

hirundo · on April 10, 2023

This may be backwards. When AI can cheaply, quickly and with nuance intact translate between languages, it becomes easier to use a preferred non-dominant language, which would make English less dominant. There's less incentive to spend so much time learning this oddly irregular foreign tongue if the skill is embedded in your phone.

lucb1e · on April 10, 2023

> All the AIs will be trained on the available media - most of which is English.

Is it?

generalizations · on April 10, 2023

The pile is an open dataset, and so is libgen. Should be pretty easy to confirm.

kg · on April 10, 2023

There's some nuance to this, I think. For one arbitrary example where this might not hold: NovelAI was trained on data from 'danbooru', an imageboard where people repost and tag art. All the tagging on that site is in English and they frequently also translate things like the author's description of the image and any in-image text. So if you were to use that site as a dataset, it would all be English.

Or is it? The original source content was in a mix of languages - english, japanese, chinese, korean, etc. It only then got translated into english and tagged in english. So if you had trained on the original source content, you would have been training on a mix of languages, but that got erased for the convenience of the people training the network.

unaindz · on April 10, 2023

Even if the original images have a mix of languages I think the tagging is all done in english (I may be wrong). I would argue that the source material includes the tagging as it is necessary for the AI to get trained so the content is not really mixed but entirely english.

But anyways the danbooru tags consist of things like: short hair, blue eyes, portrait. Things that are much more easier to translate (or "understand") in several languages than entire phrases like GPT does.

kg · on April 11, 2023

Yeah, the danbooru tagging is done in english. However, if the art is sourced from places like Pixiv, those sites do tagging in the site's native language. My point is that the original content was in a mix of languages, but the process of tagging and training normalized it all into english and results in a situation where even the people who authored the original art will now pay more to use the resulting networks if billed per-token unless they learn English. So we're basically taking all this input from various cultures, Englishifying it, and then potentially billing them more if they want to keep using their native tongue. Kind of sad.

wongarsu · on April 10, 2023

Libgen is 57% English (17% Russian, 8% German) [1]. By comparison, 10% of Wikipedia is in English [2] (going by number of files and number of articles respectively, both flawed metrics)

Though I feel that's answering a slightly different question. Data used to train currently popular models is mostly English, and the marjority of data in sources popular in the anglosphere is English. Neither of these show whether the majority of available media is English.

https://www.reddit.com/r/libgen/comments/r3lzg2/top_15_langu...

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Co...

ecshafer · on April 10, 2023

FWIW Chinese tech companies have a lot of stuff also that is really impressive like WuDao 2.0. They just don't get the same amount of press.

NLPaep · on April 10, 2023

An old version of Google’s MoE without any benchmarks is impressive?

mutagen · on April 10, 2023

I'm wondering about this in the context of new programming languages. If people are using LLMs to learn a new language, will a new programming language be at a disadvantage until there's a critical mass of code, comparisons to existing languages, Rosetta Stone style examples, etc?

vintermann · on April 10, 2023

> All the AIs will be trained on the available media - most of which is English.

Are you sure about that? Most of the media we see, sure, but there has been, and still is a lot of media being produced in other languages.

_nrnx · on April 10, 2023

So what I got from this is that GPT was trained on a dataset that biased in English contents. Is that right?

I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don't know about natural multilinguals.

terafo · on April 10, 2023

Nope, it's not about dataset. It's just bad tokenizer. Korean has couple of dozen of symbols in it's alphabet. Cyrillic languages have less than 50 symbols in total. Hiragana is 46 symbols. GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.

dragonwriter · on April 10, 2023

> GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.

GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.

But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.

terafo · on April 10, 2023

I confused LLaMa vocabulary size, which is indeed 32k, with GPT-4 vocab size. Still, my point stands. You can add those characters there with miniscule cost.

nereye · on April 10, 2023

> Korean has couple of dozen of symbols in its alphabet.

While that is true (14 consonants, 10 vowels [0]), there are encodings for Korean that encode at the syllable level (where each syllable contains one or two consonants and one vowel) and the combinations for syllables are over 10000 (e.g. 11172 code points listed in Unicode, see [1]).

[0] in practice, more, both to cater for both modern and obsolete forms as well to distinguish the forms based on their position, i.e. with separate encodings for leading vs trailing consonants etc.).

[1] https://en.wikipedia.org/wiki/Hangul_Syllables

morelisp · on April 10, 2023

In a bizarre coincidence I've just been working on code handling Korean cluster breaks and while it's true there's a lot of codepoints, the rules for handling them are mathematically trivial when considered as codepoint values.

(But I guess I also won't be surprised if the OpenAI guys can't write algorithms worth spit if it's not a large matrix multiplication.)

shagie · on April 10, 2023

Including those alphabets as letters or single glyphs would still leave it so that ドイツ would still take 3 tokens whereas "Germany" is one token ("germany" is two tokens: [ger][many]).

And tossing ドイツ into the tokenizer shows that it is 3 tokens.

Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?"

The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens.

Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'.

Likewise, its GPT tokenization is 5 tokens.

make3 · on April 10, 2023

using plain characters would make the sentences longer & cost much more money to use.

that's the idea of byte pair encoding based tokenizers, reduce the average sentence's number of tokens to an optimal (short) size to reduce the computational cost. in this case, most of its training data is in english so it's going to have shorter sentences (nb of tokens) in english vs other languages

moelf · on April 10, 2023

but the tokenizer is dataset-driven... it tokenize the most common pattern in your dataset to improve efficiency, so it's 100% about dataset?

sillysaurusx · on April 10, 2023

There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter.

Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.

dragonwriter · on April 10, 2023

> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.

GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.

But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.

sillysaurusx · on April 10, 2023

Actually, GPT-3's tokenizer is the same as GPT-2. https://datascience.stackexchange.com/a/109483

You had me questioning myself for a minute.

(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)

Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.

tlrobinson · on April 10, 2023

I think yes, but more precisely the tokens were chosen to optimize training on a dataset that's biased to English content.

I am curious how the token set affects quality of responses, ignoring the factors related to token count mentioned in the post (cost, prompt expressivity, latency, etc)

Is it always better for the token set to be "native" to the majority of the training dataset and prompts/completions, or is it possible there's some "intermediate representation" (in compiler terms) that would be better?

k8si · on April 10, 2023

I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.

famouswaffles · on April 10, 2023

Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

You can test it here https://tiktokenizer.vercel.app/

kg · on April 10, 2023

Should the cost really be 15x? Or even 5x? In this case, it's not even a question of whether the network is better at English, it's that the cost to communicate with it at all in other languages is higher. Once you pay that cost you now have to deal with the network potentially generating lower quality results for prompts in non-English languages too, which raises the actual cost of doing something with GPT beyond 15x since you probably will need more attempts.

viscanti · on April 10, 2023

Because there's so much more English language for them to train on relative to most other languages, they're able to do some optimizations for English that they can't elsewhere. Should they not be able to implement optimizations for cases where they have the data volume to do so?

famouswaffles · on April 10, 2023

Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

viscanti · on April 10, 2023

> GPT-4's tokenizer is already far more efficient though still weighted to English.

Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?

kg · on April 10, 2023

"there's so much more English language for them to train on relative to most other languages" is an interesting assertion. There are billions of people on earth speaking languages other than English and they have access to the internet. Are you sure it's not just the case that we didn't scrape that data?

Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.

MichaelZuo · on April 10, 2023

> Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data.

Isn't that exactly how OpenAI managed to 10x GPT 3.5 with GPT 4.0?

JCharante · on April 10, 2023

but training against the entire Internet would still be biased towards English because English is the dominant language used on the Internet.

user_named · on April 10, 2023

What makes you think there's a "should"?

cool_dude85 · on April 10, 2023

There's always a should. Society gets a say in what people and corporations can and can't do in (at the very least) the form of laws. There's your should right there.

pixl97 · on April 10, 2023

Which society?

I mean OpanAI is a US company is unsurprisingly going to mostly communicate in English.

Are we counting all societies, if so should our software cater to all their demands, language and/or culturally demanded?

famouswaffles · on April 10, 2023

Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

GaggiX · on April 10, 2023

The article only clarifies that the dataset used to train the tokenizer is baised, not the entire dataset used by the GPT model.

wolfium3 · on April 10, 2023

You can use their online tool to see how it tokenizes words: https://platform.openai.com/tokenizer

minimaxir · on April 10, 2023

It's worth noting that this only for GPT-3. If you're using ChatGPT or GPT-4, both use a different tokenizer that's more robust and uses/generates about 10% fewer tokens. (unclear how well it performs for non-English languages)

You can test it offline using tiktoken: https://github.com/openai/tiktoken

dchest · on April 10, 2023

Here's online version: https://tiktokenizer.vercel.app/

sillysaurusx · on April 10, 2023

10% smaller vocab size, or 10% fewer tokens on average? I assume the latter, but total vocab size is also an interesting metric.

The tokenization speedups in that repo are very impressive. It was the most annoying part about processing 190,000 books. I think it took a few days on a server with 96 cores.

Surprisingly hard to figure out the vocab size from that repo.

minimaxir · on April 10, 2023

10% fewer tokens on average.

The vocab size itself is doubled. (~50k for GPT-2/3, ~100k for ChatGPT)

sillysaurusx · on April 10, 2023

Wow. Does that help to double the vocab size?

It certainly makes training more expensive. One clever trick to get some memory savings is to freeze the vocab embedding layer when fine tuning. It makes a noticeable improvement, both in speed and in mem required.

Surprised they went the larger vocab route. LLaMA is only 30k. I wonder what the reason is...

Thanks!

minimaxir · on April 10, 2023

A larger vocab takes longer to train but has no (practical) impact at inference time as an Embeddings index is just a key-value store, which is very helpful as GPT starts hitting scaling laws.

karmoka · on April 10, 2023

"Je voudrais une pizza" is better translated to "I would like a pizza" "I want a pizza" would be "je veux une pizza"

idleproc · on April 10, 2023

"J’aimerais..." is better translated to "I would like...", n'est-ce pas?

karmoka · on April 10, 2023

Both "je voudrais" and "j'aimerais" translate to "i would like", albeit with some nuances in the connotations. The later has more of a wishful quality, more open to rejection. In spoken form, they're mostly interchangeables.

ahoef · on April 10, 2023

Depends. "Amerais" emphasises the liking of the pizza and "voudrais" emphasises getting the pizza.

bob1029 · on April 10, 2023

If you think about this from a "language is computation" perspective, it starts to get even more interesting.

For example, what would the real-world performance of ChatGPT be if we had trained it predominantly on German or Korean text?

Is English actually the best language/structure for this system?

biztos · on April 10, 2023

Maybe there’s a competitive advantage in training a new one on just German, say, and unleashing it on automotive engineering problems?

DonHopkins · on April 10, 2023

I'm afraid the Germans might weaponize it to come up with deadly funny jokes!

https://www.youtube.com/watch?v=Qklvh5Cp_Bs

wordpad25 · on April 10, 2023

HUGE SALE! Save 93% OFF on GPT API by translating prompt into English first!!!

rubywilde · on April 12, 2023

Actually, it is not true. Hilarious

Author compares different encoders: for Facebook's NLLB and GPT2. Where did title came from?

Another point is that OpenAI changed encoders for chat models. Link: https://github.com/openai/openai-cookbook/blob/main/examples...

Now English is less optimized for tokens usage and other languages are much more balanced. E.g. Ukrainian takes only twice as much tokens, before it had 6 times more tokens

FrostKiwi · on April 11, 2023

So glad someone took the time to put up some data about it. Since day one, the subpar results for Asian languages has stuck out to me. It's especially true for LLama-derived models, where the output is just abysmal. It's my own pet theory, that bad tokenization is an important reason as to why they suck so much in the first place.

It's not just broken grammar, it's a surprising lack of creativity, that English doesn't suffer from. ChatGPT English -> DeepL and fixing the auto-translation gives vastly improved results, than prompting ChatGPT to respond in an asian language.

mgaunard · on April 10, 2023

So for latin languages, they tokenize per word, and somehow for asian languages, it's tokenizing per radical.

Of course you'd end up with a lot more tokens. Just tokenize by word regardless of language.

k8si · on April 10, 2023

"word" isn't a useful concept in a lot of languages. Words are obvious in English because English is analytic: https://en.wikipedia.org/wiki/Analytic_language

But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.

mgaunard · on April 11, 2023

Of course most indo-european languages have declensions or at least conjugation. That includes English even if it is overly simplistic there.

CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.

Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).

crazygringo · on April 10, 2023

Words aren't an equivalent count between languages either. English uses a lot of helper words, some other languages use multiple suffixes. Chinese characters don't even make it clear where "word" boundaries are -- there are no spaces.

mgaunard · on April 10, 2023

Chinese does make it explicit where word boundaries are.

The only language that doesn't is Thai, but there are still well-documented algorithms for it.

biztos · on April 10, 2023

Really, only Thai? Is there a reference for that? A quick search suggests it’s not the case, but I’m no expert.

As a lowly beginner I find the lack of word boundaries in Thai frustrating but I think it’s just that I have not yet learned to think in syllables, I’m still always sounding them out in my head until I have a word I recognize, there’s no flow.

This seems like something the LLMs should be very good at. Google Translate does OK-ish while Apple just throws up its hands in frustration and refuses to translate Thai texts.

mgaunard · on April 11, 2023

Read the Unicode standard, it covers all of these things.

crazygringo · on April 10, 2023

How does it make it explicit? You need a dictionary to figure it out, no? Same as e.g. Japanese?

kccqzy · on April 10, 2023

Right but such dictionaries are already built in to all major operating systems. The double-click-to-select-word interaction works well with Chinese and Japanese in all major operating systems. Without such dictionaries you can't even implement word selection.

fomine3 · on April 11, 2023

It works until it recognizes 外国人参政権 as foreign/carrot/regime

jltsiren · on April 10, 2023

It's more like some big languages receive special treatment, while everything else is interpreted as a byte stream. In Finnish language, the tokens seem to be arbitrary substrings of average length 3-4, and they rarely correspond to any semantically or grammatically meaningful units.

Imnimo · on April 10, 2023

Setting aside the specific choice of tokenizer for GPT models, I'm curious how much difference in performance is made by the features of the human language used to represent the training data. Like if you kept the exact same training corpus and could wave a magic wand and translate it into any language and could create a custom tokenization for each language, would some be more amenable than others to GPT-style language modeling?

startupsfail · on April 10, 2023

I’m finding it amazing that the model comes localized and supports obscure languages and is available. Compare this to traditional software. Or even to web software. Does Google come localized to all of these languages, for example?

Yes, there is overhead from localization. So what, this overhead was always there for software.

jinushaun · on April 11, 2023

The French example is strange and shows that the language model has an English bias.

  - “I want a pizza” = 4 tokens
  - “Je voudrais une pizza” = 7 tokens

Why is “want” only 1 token in English, but “voudrais” 4 tokens? Following the French example, would “wants” and “wanted” map to 1 or two tokens?

HDMI_Cable · on April 11, 2023

I think it’s because the article itself is a bit wrong: ‘voudrais’ in French is more analogous to ‘I would like’ in English than ‘want’. Specifically, the ‘v-‘ indicates that this means ‘to want’, ‘-oud-‘ means that it is in the conditional or future, while ‘-ais’ would indicate its first person conditional. This being said, it makes sense ‘voudrais’ is more tokens than ‘want’, because it encodes more information.

seba_dos1 · on April 10, 2023

tl;dr - because it operates on tokens, not words, and the set of tokens it uses is optimized for representing English text.

shagie · on April 10, 2023

Do other languages have as nice a mapping to tokens?

For example, if you were to go from French, you'd have 33 characters to work with rather than 26 (accents such). And you'd have chemisier and chemisière being two different genders of the same word that are used in different contexts.

English tends to not have this difference.

Likewise, French has more verb conjugation forms than English does.

If you were to go to Japanese, you'd have the hiragana, katakana and kanji.

While my Anglocentrism may be showing, I'm not sure there is another language that tokenizes as well when it comes to novel character combinations.

    Make up a new word.  Use it in a setence.  Give a definition for it.

    My new word is 'diflubble'. It is the feeling one gets when they are both excited and nervous in anticipation of an upcoming event. 

    For example, I felt diflubble on the morning of my graduation ceremony.

vs:

    Make up a new word in Japanese.  Use it in a setence and give a translation for it.  Give a definition for it.

    My new Japanese word is "keigarou", which means "being full of energy".

    例えば、私は今日、keigarouな気持ちでいます。
    Translation: For example, I am feeling keigarou today.

The thing there is that you can't just make up new kanji. And it wouldn't be hiragana either.

hadlock · on April 10, 2023

I would imagine they have far more english optimized compute instances running

H8crilA · on April 10, 2023

There are no separate instances for different languages. ChatGPT is inherently multilingual.

29athrowaway · on April 10, 2023

It is not that tokenization is optimized for English, but rather the other way around perhaps.

Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words.

Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.

"La mesa" refers to a female table, although tables are not lifeforms and have no sex.

To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.

PufPufPuf · on April 10, 2023

It's funny that you're calling English "effective" because it has shorter words, even though word length has nothing to do with tokenization effectiveness -- if a long word is frequent enough, it becomes a single token. That's the point of doing tokenization instead of feeding raw bytes into the model.

BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.

Y_Y · on April 10, 2023

"la mesa" isn't a female table, it's just a table. If you want to specify that the table is female (in reality) then you might say "mesa hembra". The fact that "mesa" is _grammatically_ feminine is a red herring. It's a rule of the language that occasionally corresponds to nature, but that's in a very limited minority of cases. You can think of grammatical gender like an optional redundant bit (against, say mishearing) when giving some information, but since there's no other way to talk about a table it doesn't give any more information than "the table" when written down.

29athrowaway · on April 10, 2023

Yet, "el mesa" is wrong. You have to memorize it.

bckygldstn · on April 10, 2023

Also wrong are "a hour" and "an cats". Sometime Spanish uses one word ("hablo") where English needs two ("I speak").

Comparative analysis of language isn't taboo [1]. It's just vastly more complicated than you make out, and the specific examples you chose aren't representative enough to support any point.

You're likely getting downvoted for misunderstanding basic socio-linguistic concepts that belie the confidence of your arguing: conflating biological and grammatical gender, implying that English was created by a committee of clever language designers, a focus on letters and words over concepts and comprehension.

[1] https://www.science.org/doi/10.1126/sciadv.aaw2594

29athrowaway · on April 11, 2023

Which you can infer from the word characters instead of memorization.

dmm · on April 12, 2023

You use "an" when the word starts with a vowel _sound_, regardless of spelling, and pronunciation has to be memorized in English. "The class lasts an hour and he's getting an MBA" is the correct usage even though they both start with consonants.

cool_dude85 · on April 10, 2023

One wonders whether highly agglutinative languages, then, might have even better performance than English in the tokenizer since they can pack much more meaning into a single word.

The linked article shows one such language, Malayalam, costing 15.7 times more. Try again.

kg · on April 10, 2023

If you familiarize yourself with ideographic/ideographic-adjacent languages like Japanese or Chinese you will probably notice that they are way more efficient than English. Yet those languages pay a tokenization tax too (thanks in no small part to the decisions of the largely western Unicode committees to favor western character sets - the UTF8 encoding favors ASCII tremendously)

mcswell · on April 10, 2023

I usually use the term "tokenization" to refer to breaking a text into "words" (tokens), although in the examples shown in the article, for the Latin script languages it seems to be doing tokenization into something like morphemes. This has nothing to do with the Unicode UTF-8 encoding system; Hindi would have the same number of tokens if you encode it with UTF-8 (where each character is 3 bytes) or ISCII (where each character is 1 byte).

But when it comes to Chinese...something weird is going on.

kg · on April 10, 2023

The behavior on Chinese is what makes me believe it's tokenizing on something like UTF-8 (hopefully normalized). I'm not sure how else you would get that behavior.

Tokens for non-english languages that are groups of characters just suggests that common groups of 2-3 characters from the training set became tokens, which feels unsurprising. The fallback behavior would be 1 utf8 byte = 1 token.

Vecr · on April 10, 2023

That might not be true. OpenAI do set a limit of the total number of tokens, and since I'm pretty sure they trained the model and the tokenizer on mostly English text, I assume there's a somewhat proportional bias toward English based on the input dataset to those models.

aqme28 · on April 10, 2023

Different languages have different levels of conciceness of course, but I highly doubt that Spanish is anywhere close to 15x less concise than english.

MichaelZuo · on April 10, 2023

I can believe that it's 1.5x less concise and that there's 10x less training data in Spanish compared to English.

explaininjs · on April 10, 2023

Eh… “la mesa” is “the table”, English wins. Even in context, spanish conjunction rules allow you to elide pronouns in many cases that would be confusing in english.

The reason spanish might encode longer is the tokenization scheme compacts tokens based on popularity in training data, and most training data was english. No more no less.

k8si · on April 10, 2023

Communication rates are very similar across languages: https://www.science.org/doi/10.1126/sciadv.aaw2594

See also (great read): https://pubmed.ncbi.nlm.nih.gov/31006626/

wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.

The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.

tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently