You are a teenager who needs oral contraceptives because you are sexually active. You don't want your parents to find out. Since you're a teenager, you have a few constraints:
- you have no car, how do you get to your doctor's appointment without asking parents for a ride?
- you are a minor, do you have any guarantee that your doctor won't tell your parents? You can't risk them finding out, they are very conservative
- you may not have ever made a doctor appointment for yourself before, maybe don't have access to insurance information etc
Planned Parenthood provides BCPs at a price you can afford with your teenager job (also guarantees privacy) but the closest one is hours away...
Certain brands (e.g. SkinnyPop) advertise their bags as "chemical-free" (SkinnyPop claims their's is free of PFOAS). Can anyone help me understand/verify these kinds of claims?
Does anyone know of a good piece of writing about what has made TSMC so successful, what makes their management so good, etc.? Seems like an operational exemplar I'd like to learn more about.
wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.
The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.
tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently
But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.
Of course most indo-european languages have declensions or at least conjugation. That includes English even if it is overly simplistic there.
CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.
Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).
I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.
Companies have been hand-wringing about the tech labor shortage for the last 10 years. People went to school and got degrees in a job sector they thought would be pretty safe. Supply/demand.
For GPT4: "Pricing is $0.03 per 1,000 “prompt” tokens (about 750 words) and $0.06 per 1,000 “completion” tokens (again, about 750 words)."
Meanwhile, there are off-shelf models that you can train very efficiently, on relevant data, privately, and you can run these on your own infrastructure.
Yes, GPT4 is probably great at all the benchmark tasks, but models have been great at all the open benchmark tasks for a long time. That's why they have to keep making harder tasks.
Depending on what you actually want to do with LMs, GPT4 might lose to a BERTish model in a cost-benefit analysis--especially given that (in my experience), the hard part of ML is still getting data/QA/infrastructure aligned with whatever it is you want to do with the ML. (At least at larger companies, maybe it's different at startups.)