Yeah. reminds me of the ancient okcupid data analysis blogs and not the creepy one by sleep8. The group I'm surprised not to see represented in their analysis is "personal", where people I know use ChatGPT as a therapist/life coach/sms analysis&editor. and of course they crucially but understandably left off the denominator. 35% of a million requests is different than 35% of a billion. and also how many of the conversations had 1 message, indicating "just testing" vs 10 or 100 messages.
True! Consistency and representativeness matter, in soup samples as in social samples!
Is the soup smooth or lumpy? Striated or uniform? For that matter a soup could (and often does) involve huge soup bones that give it important parts of its flavor, but never show up directly in a spoonful. And you might need something different from a spoon to convincingly rule out some specific rare lumpy ingredient.
The didactic value of sampling the soup pot goes well behind its basic function: correcting the beginner’s misperception that a sample’s statistical power is directly related to population size :)
35% of a million students in the USA is very different to 35% of a billion students across the USA, Europe and Africa.
Since there aren't a billion students in the USA, 35% of them is an impossibility.
If you scale your population above some recognized boundary you aren't sampling in the same space any more. After all the local star density to 1AU tends very strongly to 1. That's not indicative of the actual star density in the milky way.
What do you mean by “statistically”? The end results would be like three orders of magnitude apart. Wouldn’t the desired sample size depend on the size of the population itself?
>Wouldn’t the desired sample size depend on the size of the population itself?
No, The most important thing is the distribution of the sample size. You have to make sure it isn't obviously biased in some way (i.e You're only surveying students in a university for extrapolation on the entire population of the country). Beyond that, the desired sample size levels off quickly.
5000 (assuming the same distribution) won't be any more or less accurate for 10M than it is for 1M.
Of course, if you just ask everyone or almost everyone then you no longer need to worry about distribution but yeah