More

k8si · on July 18, 2023

Do something more ambitious that is bigger and more impactful than closing JIRA tickets.

k8si · on July 13, 2023

You are a teenager who needs oral contraceptives because you are sexually active. You don't want your parents to find out. Since you're a teenager, you have a few constraints:

- you have no car, how do you get to your doctor's appointment without asking parents for a ride?

- you are a minor, do you have any guarantee that your doctor won't tell your parents? You can't risk them finding out, they are very conservative

- you may not have ever made a doctor appointment for yourself before, maybe don't have access to insurance information etc

Planned Parenthood provides BCPs at a price you can afford with your teenager job (also guarantees privacy) but the closest one is hours away...

What do you do?

k8si · on June 23, 2023

Certain brands (e.g. SkinnyPop) advertise their bags as "chemical-free" (SkinnyPop claims their's is free of PFOAS). Can anyone help me understand/verify these kinds of claims?

k8si · on May 16, 2023

Does anyone know of a good piece of writing about what has made TSMC so successful, what makes their management so good, etc.? Seems like an operational exemplar I'd like to learn more about.

bitL · on May 16, 2023

Low wages, highly educated workforce which keeps grinding, large Apple investments to the latest nodetech, Intel's execution failure.

ramraj07 · on May 16, 2023

Check out the asianometry YouTube channel.

mcbishop · on May 16, 2023

The book: Chip War

k8si · on May 2, 2023

What we really need is PraaS (Praat as a Service). Praat Cloud Edition. Etc.

k8si · on April 10, 2023

Communication rates are very similar across languages: https://www.science.org/doi/10.1126/sciadv.aaw2594

See also (great read): https://pubmed.ncbi.nlm.nih.gov/31006626/

wrt your Spanish example: grammatical gender adds information redundancy to make it easier to process spoken language (e.g. helps with reference resolution). This redundancy enables Spanish speakers to speak at a relatively fast rate without incurring perception errors. English has fewer words but a slower speech rate. It's an optimization problem.

The speech rate issue isn't as obvious if you're only looking at text, but I'd argue/speculate that lossless speech as a language evolutionary constraint has implications for learnability.

tl;dr there is no communication tax, languages are basically equivalent wrt to information rate, they just solved the optimization problem of compactness vs speech rate differently

k8si · on April 10, 2023

"word" isn't a useful concept in a lot of languages. Words are obvious in English because English is analytic: https://en.wikipedia.org/wiki/Analytic_language

But there are tons of languages (not just CJK languages) that use either compounding or combos of root + prefix/suffix/infix to express what would be multiple words in English. E.g. German 'Schadenfreude'. Its actually way more useful to tokenize this as separate parts because e.g. 'Freude' might be part of a lot of other "words" as well. So you can share that token across a lot of words, thereby keeping the vocab compact.

mgaunard · on April 11, 2023

Of course most indo-european languages have declensions or at least conjugation. That includes English even if it is overly simplistic there.

CJK languages do not really have that, they don't even have conjugation. They have simple suffixes at best to mark a verb as being interrogative.

Words do exist in all CJK languages and are meaningful. Tokenizing by Hangul radicals doesn't really make sense if you're not going to tokenize by decomposed letter in romance languages too (e.g. hôtel would be h, o, ^, t, e, l).

k8si · on April 10, 2023

I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.

k8si · on March 20, 2023

Companies have been hand-wringing about the tech labor shortage for the last 10 years. People went to school and got degrees in a job sector they thought would be pretty safe. Supply/demand.

k8si · on March 15, 2023

For GPT4: "Pricing is $0.03 per 1,000 “prompt” tokens (about 750 words) and $0.06 per 1,000 “completion” tokens (again, about 750 words)."

Meanwhile, there are off-shelf models that you can train very efficiently, on relevant data, privately, and you can run these on your own infrastructure.

Yes, GPT4 is probably great at all the benchmark tasks, but models have been great at all the open benchmark tasks for a long time. That's why they have to keep making harder tasks.

Depending on what you actually want to do with LMs, GPT4 might lose to a BERTish model in a cost-benefit analysis--especially given that (in my experience), the hard part of ML is still getting data/QA/infrastructure aligned with whatever it is you want to do with the ML. (At least at larger companies, maybe it's different at startups.)

potatoman22 · on March 16, 2023

This is true, but the humans to develop the non-LLM solution are expensive and the OpenAI API is easy.