I would want to see some data on tokenization for some real-world examples. "Je voudrais une pizza" actually translates more directly to "I would like a pizza" which is 5 tokens. But also I think there's some danger here in terms of this might be cherrypicking examples. Spanish is a lot more dense than English or French and might tokenize better. (I see "quiero pizza" is 4 tokens which seems like the right number of tokens to me - "quiero" actually contains "I want <present tense>") You could argue it's 2 or 3 tokens but 4 seems preferable.
For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.
But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.
> Spanish is a lot more dense than English or French and might tokenize better.
I'm no linguist, so I apologize if I'm misinterpreting this statement. My impression has always been that Spanish is less dense than English, only because in almost all cases, the Spanish version of product instructions is wordier. Look at the back of a shampoo bottle[0] and notice that the Spanish version is either longer, or a smaller font, to fit it all.
Instruction manuals are going to be translated and they're hopefully verbose such as to be explicit.
One area where Spanish is more dense is verb forms, because it retains most of the inflected verbs of Latin, whereas English has lost or merged together a lot of the historical Indo-European inflections. Speaking intuitively, I think it, like most Latin languages, tends to be a bit more verbose with noun phrases.
Another way to measure this is speaking rate. What I remember from linguistics courses is 1) that whole different cultures seem to speak at different average speeds, the information content transferred per second of speech seems to be remarkably consistent across languages; and 2) people speak Spanish more quickly than they speak English.
It's probably not a good idea to judge the density of a language by product instructions that are probably a minimally workable translation into the language.
One of the criteria of the most often used definition of open source [0] is that the program is free to use and modify for all purposes. So a noncommercial license would not qualify as open source. This is also a requirement of the free software definition of the FSF, which is also often used to define free/open source.
GNU zealouts (and I consider myself one) need to take a deep breath and re-think what is being said.
It *IS* accurate to say that the program and source must be free to use for all purposes the recipient wants including commercial.
This was an involved conversation that was had during the 90's and yes, "no commercial reuse allowed" licenses are not -in fact- free licenses. I might be wrong but I have the impression they are not allowed on debian cds/dvds for that reason.
> If I use a piece of software that has been obtained under the GNU GPL, am I allowed to modify the original code into a new program, then distribute and sell that new program commercially? (#GPLCommercially)
> You are allowed to sell copies of the modified program commercially, but only under the terms of the GNU GPL. Thus, for instance, you must make the source code available to the users of the program as described in the GPL, and they must be allowed to redistribute and modify it as described in the GPL.
> These requirements are the condition for including the GPL-covered code you received in a program of your own.
Wrong. Open Source does not have to bend the knee to whatever proprietary license you can dream up and shove into your code base.
>Source code: The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost preferably, downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
Does this apply to Linux? Check.
>Derived works: The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software.
Does this apply to Linux? Check.
Please visit https://opensource.org/licenses/ to see all of the licenses that are generally agreed upon to be Open Source licenses.
"open source" has a broader English meaning that predates the OSI for at least several decades. OSI does not have a trademark on "open source" because of this.
This is the software licensing world's version of "a hotdog is not a sandwich"
A lot of people confuse, say, Switzerland and Sweden, but this does not make it valid to call either by the other name. Likewise, “Open Source” has a precise definition, and people being confused does not make it less so. Of course, a lot of people are not actually confused, but are engaging disingenuously in order to dilute the term, so that they can use it for their own ends.
English isn't prescriptive. In English, if people use a word or a phrase to mean a thing, it means that thing. OSI has a widely observed technical definition, but it is not universal, and more colloquial uses of the word are recognized by linguists because they factually exist.
It would be one thing if you made that argument about some old term, like “mountain”, or “island”; those have definitions, but the edges are fuzzy and vary, since the terms are old and saturated since prehistorical times. With “Open Source”, it’s different. The wording existed previously, yes, but only as a technical term in intelligence gathering. Applied to software, on the other hand, the term is new, created by the OSI, which gave it a strict definition from day one. People cannot have heard of the term unless it came from OSI. Any claim of deviation from the OSI meaning, then, can be simply discarded as incorrect.
This debate is beyond silly. It’s like arguing about what the rules of, say, Settlers of Catan is. The rules are the official rules which come with the box; anything else is house rules or custom rules, and cannot be used in something like an official tournament. When people say that “Settlers of Catan does this thing X”, and the official rules expressly says it does not do X, they are (knowingly or not) being misleading.
> The wording existed previously, yes, but only as a technical term in intelligence gathering. Applied to software, on the other hand, the term is new, created by the OSI, which gave it a strict definition from day one. People cannot have heard of the term unless it came from OSI. Any claim of deviation from the OSI meaning, then, can be simply discarded as incorrect.
All of these claims are untrue. Here is an example of open source being used to describe software in 1996. OSI was founded in 1998.
> This debate is beyond silly. It’s like arguing about what the rules of, say, Settlers of Catan
Commercial board games typically use trademark law to prevent others from changing their rules. Popular games which do not have legally protected names often do have multiple sets of rules defined by different people. e.g poker.
> All of these claims are untrue. Here is an example of open source being used to describe software in 1996.
Interesting. The attendees of the meeting on February 3rd, 1998 certainly all seem to think that they at least independently re-invented the term, so the term can’t have been very common. The meeting was held two weeks after the announcement of the release of the Netscape source code, and the announcement did not use the term.
The definition of “open source” is universally agreed upon to have the OSI-defined meaning, except for some people:
1. Intelligence community people, who have long understood the term “open source” to mean a source of intelligence which is not itself secret.
2. People who, without having ever looked it up, assume it means that the source code is available for reading. These people are simply ignorant, and should be using the term “source available” instead, since it means exactly that.
3. People who want to be able to use the “open source” term for their software to gain goodwill, but don’t want to actually give all of the freedoms it should guarantee. These people are dishonest shills who try to confuse the debate in order to get away with fraudulent labeling.
> People who want to be able to use the “open source” term for their software to gain goodwill, but don’t want to actually give all of the freedoms it should guarantee.
Or is "open source" just a term for "free" as in beer software that doesn't actually give people all the freedoms it should guarantee? Because that's what the FSF thinks.
Different people have different ideas about what freedoms people "should" have. Nobody is being dishonest about software freedoms when the BSD-4-clause was written, CC0 or when they write licenses with 'no evil' or 'no nuclear proliferation' clauses.
> Or is "open source" just a term for "free" as in beer software that doesn't actually give people all the freedoms it should guarantee? Because that's what the FSF thinks.
No it isn’t. The OSI invented the term "Open Source” as applied to software, and they get to define its meaning as what they intended.
You misread my comment. That page explains why the FSF does in fact believe that open source software does not give people the freedoms it "should" guarantee.
The freedoms that a license "should" convey is not a fact, it is an opinion. And there are more than a few valid and honest opinions that exist, even beyond the opinions of FSF/OSI/CC/UCB/USG/Apache/FAANG/whoever
How is that relevant? What does the opinion of FSF (about what a licence “should” contain) have to do what you consider to be the proper meaning of the term “Open Source”?
It is a response to your point numbered "3." above. There are honest and good-willed licenses which are not OSI, written by honest and good-willed people who disagree with OSI.
Yes, and? The FSF may disagree with the OSI on some matters, but the FSF does agree on the definition of the term “Open Source”, which was what we were discussing. Do you have a different definition of “Open Source” (as applied to software), and why should that definition take precedence over that of the definition from the OSI?
To bring it back to the point: The article claimed that “NLLB (No Language Left Behind) has been open sourced by Facebook”, which is misleading, since “open source” has a strict definition, and the license of NLLB did not qualify with the very first point in the OSI Open Source Definition. Facebook released the source code, under an open license; they could even call it a Creative Commons license, which it was. But the article can’t truthfully call it “open source”, since it isn’t.
OSIs licenses are only for software. If you open source things other than software, you’ll have to use a license that addresses those types of media. Which is what Facebook did. CC licenses are a popular way to “open source” non-software content.
You are again using the verb “open source” as a synonym for “release” or “freely license”. It is the very subject of this debate that I do not think this to be appropriate unless an OSI-compliant license it used; therefore, you can not now use it as an argument in this same debate.
The OSD applies only to open source software. It is nonsensical when applied to non-software works. You can’t release the source code for a language model because they don’t have source code.
I disagree. You don’t get to decide what words mean. Open source means open source, that’s it. If you want it to mean something else you should’ve chosen a phrase that didn’t already have a meaning.
Sometimes old words and terms acquire new meanings. The only meaning “Open Source” had before the OSI was the intelligence “open sources” meaning. Is this the only meaning of “Open Source” you accept? If not, what is your definition, and why should that prevail over the OSI definition?
I accept that different people think it means different things, which makes me want to create a new phrase that doesn't already have a meaning. open software? Not sure, but communication is hard when you co-opt phrases that have intuitive meaning and try to supercede that.
Everyone I know ripped those LLaMA leak models and are using them extensively in "open source code" / commercial products - super unwise, but not sure licensing is actually slowing down progress in this field and even though I'm sure OpenAI is using alternative methods to make the language stuff work so well, I just wanted to comment on that front.
I wouldn't release a chatbot based on LLaMA 65B, because of the legal issues, I'm not sure others are using the same restraint.
Doesn't really matter. There's lots of positive transfer in individual language learning. Competence in one language bleeds into competence in others.
https://arxiv.org/abs/2108.13349
GPT-3 is fluent in many languages despite English taking up 93% of the corpus by word count. French is next with 1.8%
I posted on another thread that not only does GPT4 handle Norwegian just fine (0.1% of training data for GPT3), but Norway has two official languages that are mutually intelligible and close enough that some would consider them dialects, but GPT can handle Nynorsk, the smaller of the two (Bokmål being the other) just fine.
Going one step further, I asked it to "translate" into both "Riksmål", an artificial conservative variant of Bokmål that basically rejects most of the last few decades worth of language reforms, as well as Romeriksdialect (dialect from the Eastern part of Norway)... For the latter it gave me a lecture about how it varies internally in the region (which is correct) and presented a "translation" of a test sentence that is recognisably one of the variants from the Northern part of the region.
Of course for these competency definitely bleeds over. They share an almost identical grammar and a majority of orthography, but I'm impressed enough it can handle Norwegian that well at all, much less that it knows the distinctions between the variants.
Yeah, its language skills are through the roof. There's no reason to talk to it in English. From what I can tell, it does a decent job of translating out of even languages like Southern Sami, with ~300 speakers and utterly neglible training corpus. It seems it knows enough about grammar from related languages, and can infer enough from context (and maybe even etymology) that it does an OK job.
I tested it by giving it some news articles from NRK Sápmi, and compare it with the Norwegian translation they have.
Edit: Seems I may have gotten lucky that time, it's being a lot more, um, creative in its translation now. Or for all I know it could be changes in the model.
Looking at the basic ChatGPT (not GPT-4) while it can do reasonable translations for smaller languages and answer questions in them, the quality of the answers suffers significantly in my experience, if I ask the same factual question in two languages, I often see that the English one gets a correct answer while the small language gets a coherent hallucination. For big languages (French, Japanese, Spanish, etc) that's not an issue, but for the smaller ones it clearly is.
Depends what you're doing. I haven't managed to make it continue after it stopped in the middle of a sentence in Japanese, but giving it the instruction to do so in English does. In some other cases, prompting in English (and asking for an answer in Japanese) can produce better results than giving the same prompt in Japanese.
Generating Japanese is slower than English (it's annoying on GPT-4), that's my reason to prefer English sometimes (especially for tech topics). ChatGPT web users don't pay for each token, but API users pay for each token, so they would make different decision.
In my experience, while "continue" can work, "続けて" doesn't. At least not when making it rewrite large texts, which is when I hit the limit. With "continue", it continues rewriting. With "続けて", it tends to make up new text, that yes, is the continuation of what it was writing, but with no connection to the original text it was in the middle of rewriting.
This may be backwards. When AI can cheaply, quickly and with nuance intact translate between languages, it becomes easier to use a preferred non-dominant language, which would make English less dominant. There's less incentive to spend so much time learning this oddly irregular foreign tongue if the skill is embedded in your phone.
There's some nuance to this, I think. For one arbitrary example where this might not hold: NovelAI was trained on data from 'danbooru', an imageboard where people repost and tag art. All the tagging on that site is in English and they frequently also translate things like the author's description of the image and any in-image text. So if you were to use that site as a dataset, it would all be English.
Or is it? The original source content was in a mix of languages - english, japanese, chinese, korean, etc. It only then got translated into english and tagged in english. So if you had trained on the original source content, you would have been training on a mix of languages, but that got erased for the convenience of the people training the network.
Even if the original images have a mix of languages I think the tagging is all done in english (I may be wrong).
I would argue that the source material includes the tagging as it is necessary for the AI to get trained so the content is not really mixed but entirely english.
But anyways the danbooru tags consist of things like: short hair, blue eyes, portrait. Things that are much more easier to translate (or "understand") in several languages than entire phrases like GPT does.
Yeah, the danbooru tagging is done in english. However, if the art is sourced from places like Pixiv, those sites do tagging in the site's native language. My point is that the original content was in a mix of languages, but the process of tagging and training normalized it all into english and results in a situation where even the people who authored the original art will now pay more to use the resulting networks if billed per-token unless they learn English. So we're basically taking all this input from various cultures, Englishifying it, and then potentially billing them more if they want to keep using their native tongue. Kind of sad.
Libgen is 57% English (17% Russian, 8% German) [1]. By comparison, 10% of Wikipedia is in English [2] (going by number of files and number of articles respectively, both flawed metrics)
Though I feel that's answering a slightly different question. Data used to train currently popular models is mostly English, and the marjority of data in sources popular in the anglosphere is English. Neither of these show whether the majority of available media is English.
I'm wondering about this in the context of new programming languages. If people are using LLMs to learn a new language, will a new programming language be at a disadvantage until there's a critical mass of code, comparisons to existing languages, Rosetta Stone style examples, etc?
So what I got from this is that GPT was trained on a dataset that biased in English contents. Is that right?
I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don't know about natural multilinguals.
Nope, it's not about dataset. It's just bad tokenizer. Korean has couple of dozen of symbols in it's alphabet. Cyrillic languages have less than 50 symbols in total. Hiragana is 46 symbols. GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.
> GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.
GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.
But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.
I confused LLaMa vocabulary size, which is indeed 32k, with GPT-4 vocab size. Still, my point stands. You can add those characters there with miniscule cost.
> Korean has couple of dozen of symbols in its alphabet.
While that is true (14 consonants, 10 vowels [0]), there are encodings for Korean that encode at the syllable level (where each syllable contains one or two consonants and one vowel) and the combinations for syllables are over 10000 (e.g. 11172 code points listed in Unicode, see [1]).
[0]
in practice, more, both to cater for both modern and obsolete forms as well to distinguish the forms based on their position, i.e. with separate encodings for leading vs trailing consonants etc.).
In a bizarre coincidence I've just been working on code handling Korean cluster breaks and while it's true there's a lot of codepoints, the rules for handling them are mathematically trivial when considered as codepoint values.
(But I guess I also won't be surprised if the OpenAI guys can't write algorithms worth spit if it's not a large matrix multiplication.)
Including those alphabets as letters or single glyphs would still leave it so that ドイツ would still take 3 tokens whereas "Germany" is one token ("germany" is two tokens: [ger][many]).
And tossing ドイツ into the tokenizer shows that it is 3 tokens.
Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?"
The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens.
Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'.
using plain characters would make the sentences longer & cost much more money to use.
that's the idea of byte pair encoding based tokenizers, reduce the average sentence's number of tokens to an optimal (short) size to reduce the computational cost. in this case, most of its training data is in english so it's going to have shorter sentences (nb of tokens) in english vs other languages
There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter.
Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.
> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.
GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.
But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.
(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)
Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.
I think yes, but more precisely the tokens were chosen to optimize training on a dataset that's biased to English content.
I am curious how the token set affects quality of responses, ignoring the factors related to token count mentioned in the post (cost, prompt expressivity, latency, etc)
Is it always better for the token set to be "native" to the majority of the training dataset and prompts/completions, or is it possible there's some "intermediate representation" (in compiler terms) that would be better?
I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.
Data used to train the tokenizer is entirely separate from data training the LLM.
The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.
GPT-4's tokenizer is already far more efficient though still weighted to English.
Should the cost really be 15x? Or even 5x? In this case, it's not even a question of whether the network is better at English, it's that the cost to communicate with it at all in other languages is higher. Once you pay that cost you now have to deal with the network potentially generating lower quality results for prompts in non-English languages too, which raises the actual cost of doing something with GPT beyond 15x since you probably will need more attempts.
Because there's so much more English language for them to train on relative to most other languages, they're able to do some optimizations for English that they can't elsewhere. Should they not be able to implement optimizations for cases where they have the data volume to do so?
Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.
The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.
GPT-4's tokenizer is already far more efficient though still weighted to English.
> GPT-4's tokenizer is already far more efficient though still weighted to English.
Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?
"there's so much more English language for them to train on relative to most other languages" is an interesting assertion. There are billions of people on earth speaking languages other than English and they have access to the internet. Are you sure it's not just the case that we didn't scrape that data?
Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.
There's always a should. Society gets a say in what people and corporations can and can't do in (at the very least) the form of laws. There's your should right there.
Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.
The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.
GPT-4's tokenizer is already far more efficient though still weighted to English.
It's worth noting that this only for GPT-3. If you're using ChatGPT or GPT-4, both use a different tokenizer that's more robust and uses/generates about 10% fewer tokens. (unclear how well it performs for non-English languages)
10% smaller vocab size, or 10% fewer tokens on average? I assume the latter, but total vocab size is also an interesting metric.
The tokenization speedups in that repo are very impressive. It was the most annoying part about processing 190,000 books. I think it took a few days on a server with 96 cores.
Surprisingly hard to figure out the vocab size from that repo.
It certainly makes training more expensive. One clever trick to get some memory savings is to freeze the vocab embedding layer when fine tuning. It makes a noticeable improvement, both in speed and in mem required.
Surprised they went the larger vocab route. LLaMA is only 30k. I wonder what the reason is...
A larger vocab takes longer to train but has no (practical) impact at inference time as an Embeddings index is just a key-value store, which is very helpful as GPT starts hitting scaling laws.
Both "je voudrais" and "j'aimerais" translate to "i would like", albeit with some nuances in the connotations.
The later has more of a wishful quality, more open to rejection. In spoken form, they're mostly interchangeables.
Now English is less optimized for tokens usage and other languages are much more balanced. E.g. Ukrainian takes only twice as much tokens, before it had 6 times more tokens
So glad someone took the time to put up some data about it. Since day one, the subpar results for Asian languages has stuck out to me. It's especially true for LLama-derived models, where the output is just abysmal. It's my own pet theory, that bad tokenization is an important reason as to why they suck so much in the first place.
It's not just broken grammar, it's a surprising lack of creativity, that English doesn't suffer from. ChatGPT English -> DeepL and fixing the auto-translation gives vastly improved results, than prompting ChatGPT to respond in an asian language.
But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.
Of course most indo-european languages have declensions or at least conjugation. That includes English even if it is overly simplistic there.
CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.
Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).
Words aren't an equivalent count between languages either. English uses a lot of helper words, some other languages use multiple suffixes. Chinese characters don't even make it clear where "word" boundaries are -- there are no spaces.
Really, only Thai? Is there a reference for that? A quick search suggests it’s not the case, but I’m no expert.
As a lowly beginner I find the lack of word boundaries in Thai frustrating but I think it’s just that I have not yet learned to think in syllables, I’m still always sounding them out in my head until I have a word I recognize, there’s no flow.
This seems like something the LLMs should be very good at. Google Translate does OK-ish while Apple just throws up its hands in frustration and refuses to translate Thai texts.
Right but such dictionaries are already built in to all major operating systems. The double-click-to-select-word interaction works well with Chinese and Japanese in all major operating systems. Without such dictionaries you can't even implement word selection.
It's more like some big languages receive special treatment, while everything else is interpreted as a byte stream. In Finnish language, the tokens seem to be arbitrary substrings of average length 3-4, and they rarely correspond to any semantically or grammatically meaningful units.
Setting aside the specific choice of tokenizer for GPT models, I'm curious how much difference in performance is made by the features of the human language used to represent the training data. Like if you kept the exact same training corpus and could wave a magic wand and translate it into any language and could create a custom tokenization for each language, would some be more amenable than others to GPT-style language modeling?
I’m finding it amazing that the model comes localized and supports obscure languages and is available. Compare this to traditional software. Or even to web software. Does Google come localized to all of these languages, for example?
Yes, there is overhead from localization. So what, this overhead was always there for software.
I think it’s because the article itself is a bit wrong: ‘voudrais’ in French is more analogous to ‘I would like’ in English than ‘want’. Specifically, the ‘v-‘ indicates that this means ‘to want’, ‘-oud-‘ means that it is in the conditional or future, while ‘-ais’ would indicate its first person conditional. This being said, it makes sense ‘voudrais’ is more tokens than ‘want’, because it encodes more information.
Do other languages have as nice a mapping to tokens?
For example, if you were to go from French, you'd have 33 characters to work with rather than 26 (accents such). And you'd have chemisier and chemisière being two different genders of the same word that are used in different contexts.
English tends to not have this difference.
Likewise, French has more verb conjugation forms than English does.
If you were to go to Japanese, you'd have the hiragana, katakana and kanji.
While my Anglocentrism may be showing, I'm not sure there is another language that tokenizes as well when it comes to novel character combinations.
Make up a new word. Use it in a setence. Give a definition for it.
My new word is 'diflubble'. It is the feeling one gets when they are both excited and nervous in anticipation of an upcoming event.
For example, I felt diflubble on the morning of my graduation ceremony.
vs:
Make up a new word in Japanese. Use it in a setence and give a translation for it. Give a definition for it.
My new Japanese word is "keigarou", which means "being full of energy".
例えば、私は今日、keigarouな気持ちでいます。
Translation: For example, I am feeling keigarou today.
The thing there is that you can't just make up new kanji. And it wouldn't be hiragana either.
It is not that tokenization is optimized for English, but rather the other way around perhaps.
Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words.
Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words.
"La mesa" refers to a female table, although tables are not lifeforms and have no sex.
To me some languages impose a communication tax. It is taboo because people conflate language and culture and such.
It's funny that you're calling English "effective" because it has shorter words, even though word length has nothing to do with tokenization effectiveness -- if a long word is frequent enough, it becomes a single token. That's the point of doing tokenization instead of feeding raw bytes into the model.
BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.
"la mesa" isn't a female table, it's just a table. If you want to specify that the table is female (in reality) then you might say "mesa hembra". The fact that "mesa" is _grammatically_ feminine is a red herring. It's a rule of the language that occasionally corresponds to nature, but that's in a very limited minority of cases. You can think of grammatical gender like an optional redundant bit (against, say mishearing) when giving some information, but since there's no other way to talk about a table it doesn't give any more information than "the table" when written down.
Also wrong are "a hour" and "an cats". Sometime Spanish uses one word ("hablo") where English needs two ("I speak").
Comparative analysis of language isn't taboo [1]. It's just vastly more complicated than you make out, and the specific examples you chose aren't representative enough to support any point.
You're likely getting downvoted for misunderstanding basic socio-linguistic concepts that belie the confidence of your arguing: conflating biological and grammatical gender, implying that English was created by a committee of clever language designers, a focus on letters and words over concepts and comprehension.
You use "an" when the word starts with a vowel _sound_, regardless of spelling, and pronunciation has to be memorized in English. "The class lasts an hour and he's getting an MBA" is the correct usage even though they both start with consonants.
One wonders whether highly agglutinative languages, then, might have even better performance than English in the tokenizer since they can pack much more meaning into a single word.
The linked article shows one such language, Malayalam, costing 15.7 times more. Try again.
If you familiarize yourself with ideographic/ideographic-adjacent languages like Japanese or Chinese you will probably notice that they are way more efficient than English. Yet those languages pay a tokenization tax too (thanks in no small part to the decisions of the largely western Unicode committees to favor western character sets - the UTF8 encoding favors ASCII tremendously)
I usually use the term "tokenization" to refer to breaking a text into "words" (tokens), although in the examples shown in the article, for the Latin script languages it seems to be doing tokenization into something like morphemes. This has nothing to do with the Unicode UTF-8 encoding system; Hindi would have the same number of tokens if you encode it with UTF-8 (where each character is 3 bytes) or ISCII (where each character is 1 byte).
But when it comes to Chinese...something weird is going on.
The behavior on Chinese is what makes me believe it's tokenizing on something like UTF-8 (hopefully normalized). I'm not sure how else you would get that behavior.
Tokens for non-english languages that are groups of characters just suggests that common groups of 2-3 characters from the training set became tokens, which feels unsurprising. The fallback behavior would be 1 utf8 byte = 1 token.
That might not be true. OpenAI do set a limit of the total number of tokens, and since I'm pretty sure they trained the model and the tokenizer on mostly English text, I assume there's a somewhat proportional bias toward English based on the input dataset to those models.
Different languages have different levels of conciceness of course, but I highly doubt that Spanish is anywhere close to 15x less concise than english.
Eh… “la mesa” is “the table”, English wins. Even in context, spanish conjunction rules allow you to elide pronouns in many cases that would be confusing in english.
The reason spanish might encode longer is the tokenization scheme compacts tokens based on popularity in training data, and most training data was english. No more no less.
wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.
The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.
tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently
For diacratics in French or Spanish, diacratics are logically characters. I can't think of an example where it's actually useful to split the letter into a different token but I could see it happening and not being harmful. I do think it's possible French is just weird and just needs more tokens. When I think about how I process French, I probably do treat e.g. "Je l'ai aimé" as a pathological example as 3 tokens when I speak it out loud. But I can also see why you would tokenize it as 6 tokens, I'm not sure that's Anglocentrism so much as it's recognizing a complexity difference between French and English writing.
But all this is contrast to how non-roman characters are tokenized at the byte level. That just seems bad and like it's definitely going to make it worse with non-roman languages. There's no point in having tokens that split characters.