Hacker News new | past | comments | ask | show | jobs | submit login
Computer scientists invent an efficient new way to count (quantamagazine.org)
993 points by jasondavies 8 months ago | hide | past | favorite | 287 comments



I was involved with implementing the DNF volume counting version of this with the authors. You can see my blog post of it here:

https://www.msoos.org/2023/09/pepin-our-probabilistic-approx...

And the code here: https://github.com/meelgroup/pepin

Often, 30% of the time is spent in IO of reading the file, that's how incredibly fast this algorithm is. Crazy stuff.

BTW, Knuth contributed to the algo, Knuths' notes: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

He actually took time off (a whole month) from TAOCP to do this. Also, he is exactly as crazy good as you'd imagine. Just mind-blowing.


That’s really interesting and thanks for sharing.

I am very curious about the extraordinarily gifted. What made you think Knuth is crazy good? Was there a particularly moment? Was it how fast he groked ideas? Was it his ability to ELI5?


What made me realize is that I saw some snippets of emails he wrote to a colleague. It was... insane. You could see his mind race. He recognized patterns in minutes that would take me days, if not weeks, to recognize. Also, he actually writes and runs code, overnight if need be. It was as bit of a shock to me. He's not in an ivory tower. He's very much hands on, and when he's behind the wheel, you're in for a ride.


> He recognized patterns in minutes that would take me days, if not weeks, to recognize... he actually writes and runs code, overnight if need be

70-80 years of actually being hands-on and i bet you'd be pretty quick too. dude is definitely naturally "gifted" but it seems pretty obvious being hands-on has a lot to do with it.


Disagree, there are thousands of highly experienced and hard-working computer scientists. If we grant that very few of them are the equivalent of Knuth, there must be something else at play.


Most computer scientists I know at that age don’t touch a computer any more and hang with grand kids. That’s not a value judgement - Knuth is impressive but as a human being most people choose their humanity over their careers in some way. Beyond being simply smart and productive knuth is also likely obsessive about his work and his life is warped around it. As long as that works for everyone that’s great. But most people don’t live that life.


> Most computer scientists I know at that age don’t touch a computer any more and hang with grand kids.

That doesn't sound right to me at all. Modern academia is highly competitive, and career academics typically have long working hours.

If they aren't doing programming, that's likely because it isn't relevant to their job. A theoretical computer scientist is closer to a mathematician than to a typical programmer.

> Knuth is impressive but as a human being most people choose their humanity over their careers in some way

We're talking about scientists, not most people.

Getting a PhD is no cakewalk, and there are far more PhDs than faculty positions. You can try being a workaholic, but if your competitors are doing the same, that won't make you stand out.

> obsessive about his work and his life is warped around it

Again this describes every modern scientist. Deep knowledge of one's field, and deep commitment to it, are just table stakes.


This doesn’t describe post tenure academia. You seem to be describing the life of a young tenured track academic.

Additionally while Knuth is clearly an outlier by any measure he’s also an outlier in his celebrity. There are a lot Knuths out there who aren’t well known outside their specialty, or are in industry. He played a seminal role in a field everyone studies in computer science and published a uniquely interesting and continuously revised set of fundamental books in the field. However in my time in academics there were people in say transactional memory for speculative out of order compute whose work powers every machine in use today and they still contribute similarly powerful work. They’re obsessive and very driven by the problem space. But for everyone one of those in academia there are a hundred tenured professors who paper mill their undergrads (generously).

You mention long hours but I said obsessive. That’s orders of magnitude more than working hard. It’s so distorted as to be pathological if they weren’t paid and rewarded for it. Yes many academics are pathologically obsessive. But unless they are bringing in funding or repute to fill a deficit in the department, there’s no work for them in current academic settings.

Finally Knuth isn’t a common occurrence because -he doesn’t bring in money-. Modern academia is oriented towards grant milking. The example of the txn memory guy is interesting because he brings in lots of research funding from intel and ARM and NVidia because his work is very commercial. Knuth - not so much I imagine. He brings repute, but you can only find so much repute with modern academic funding models before they’re a net negative on the department. Knuth is a fossil of a different era in academics (not used as a pejorative).


> That doesn't sound right to me at all. Modern academia is highly competitive, and career academics typically have long working hours. > If they aren't doing programming, that's likely because it isn't relevant to their job. A theoretical computer scientist is closer to a mathematician than to a typical programmer.

This kind of stuff is not useful to be posting where impressionable people (young students) can read it. The truth the majority of academics are managers and delegate hands-on work to postdocs and PhD students. I finished a PhD just last month and I never saw in 4 years anyone on my committee so much as look at code let alone write it (and I was not a theory student). Almost everyone in my cohort would echo this observation.


Experience and age have diminishing returns.

Biden's been hands-on in his domain for over 50 years, yet "quick" is definitely not the word that comes to most people's mind when they think of him nowadays.


Actually quick is definitely something that comes to mind. Quick in politics is of course relative, but the speed with which he has enacted major changes (for example marijuana legalization) is pretty quick in the realm of politics when congress is of the other party.


He’s barely able to read a teleprompter, not too confident that Biden himself enacted those changes.


We've had nearly 4 years with no scandals and emerged from the pandemic with the best economic recovery of any country, and despite having no margin to spare in Congress, master legislator Joe Biden has secured massive climate change, infrastructure, and gun control bills, not to mention he's ended our two decade war in Afghanistan and overseen the fastest wage growth of the two lowest income quintiles seen in modern history.

And every time people actually watch him speak (not just a selected clip), there's weeks of coverage about how alive JB seems, not recognizing that all evidence points to that being typical.


> We've had nearly 4 years with no scandals

This isn't the flex you think it is. When the media is lapping out of your hand like a 6 week old puppy instead of doing their fifth estate job, of course there are no scandals.

> and gun control bills,

You mean stripping Americans of their constitutional rights.

> not to mention he's ended our two decade war in Afghanistan.

Which was an unmitigated disaster.

> and overseen the fastest wage growth of the two lowest income quintiles seen in modern history.

Hello inflation.


>> and gun control bills

> You mean stripping Americans of their constitutional rights.

Not that I disagree with you, but when posters like modriano engage in political/partisan commentary on HN, I find it more productive to merely downvote and flag their comments rather than replying and getting engaged in a war.

A dead post makes quite the impression, as this sort of political commentary just generally defeats the quality of discourse on HN (which, you must admit, is much better than many other platforms, and I'd like to try to preserve it as long as possible).


> Biden's been hands-on in his domain for over 50 years, yet "quick" is definitely not the word that comes to most people's mind when they think of him nowadays.

Please don't post flamebaity political tangents on HN.

> Eschew flamebait. Avoid generic tangents.

https://news.ycombinator.com/newsguidelines.html


It’s a counterpoint anyone can identify with. One can interpret it uncharitably as flame bait if one wants to, but it need not be. It could have been Reagan in his second term, but some may not know who he was. Or Lee Smolin.


This is objectively false. It is not a counterpoint, because it's not an argument. It's an extremely subjective claim that is highly contentious (like jedberg's sibling comment[1]), definitely not something that "anyone can identify with" (as the vast majority of people do not know Joe Biden and instead view him through one of a small number of extremely skewed lenses) and clearly in the realm of "off-topic flamebait" that is not appropriate on HN.

[1] https://news.ycombinator.com/item?id=40394477


I’d avoid the Reagan or other comparison as well even if I have medical evidence for their decline as you see with Reagan. In this specific case of Biden there’s not even that so it’s purely a political opinion and it’s definitely bait for flame even if not intended as such.


> Also, he actually writes and runs code, overnight if need be...He's not in an ivory tower. He's very much hands on, and when he's behind the wheel, you're in for a ride.

That's such an admirable thing. Something to aspire to, given his age. I wonder if one can say the same thing of all the 'thought leaders' of the industry who go around pontificating about code hygiene and tidiness.


I feel extremely jealous of you.


You are envious of him.

Jealous is when you possess something you don’t want taken away by someone else


Well, that use of jealousy is only really common in certain romantic situations. Like if some super good looking dude hits on your girl and she responds in an ambiguously-flirty way, you might definitely be said to be jealous, even though she didn't run off with him.

In most other domains, though, like this one, jealousy and envy are synonyms. https://www.merriam-webster.com/dictionary/jealousy#did-you-...


I read that link as supporting the distinction, not refuting it: "It is difficult to make the case, based on the evidence of usage that we have, [that they are] exact synonyms [or] totally different words."

An envious person would be happier having something that someone else also has. A jealous person feels threatened that someone else has or wants something. This distinction applies in romantic and non-romantic contexts. Both emotions can arise, for example, when someone observes someone else wanting something neither of them has, or when the thing wanted is inherently exclusive.

I don't lose a lot of sleep worrying about others' use or misuse of these words. But I do think it's essential for people to understand which of the two emotions they're having, because the solution really depends on that. Would I be happy if I destroyed that other kid's toy? (That's jealousy.) Would I be happy if my dad showed up and gave me a similar toy, so that the other kid and I could play together? (That's envy.)


This looks dumb....very dumb, am I missing something? This is not counting, its just sampling AND if you want to actually count all the distinct words the memory size used doesnt change in comparison to just counting.


Maybe you'd know, but why would one choose to not sort favoring larger counts and drop the bottom half when full? It may be obvious to others, but I'd be curious.


The guarantees would not hold, I'm pretty sure ;) Maybe one of the authors could chip in, but my hunch is that with that you could actually introduce arbitrarily large errors. The beauty of this algorithm really is its simplicity. Of course, simple is.. not always easy. This absolute masterpiece by Knuth should demonstrate this quite well:

https://www.sciencedirect.com/science/article/pii/0022000078...

It's an absolutely trivial algorithm. Its average-case analysis is ridiculously hard. Hence why I think this whole Ordo obsessions needs to be refined -- worst case complexity has often little to do with real-world behavior.


Worst case complexity matters when the input data can be manipulated by someone malicious, who can then intentionally engineer the degenerate worst case to happen - as we have seen historically in e.g. denial of service attacks exploiting common hash table implementations with bad worst case complexity.


No, you're throwing away a random selection of 50/50. You would have to flood the algorithm with uniques or commons to set the algoritm to a probability of a known state.


You want every distinct item to have the same chance at the end. So when items repeat you need to reduce (not increase) the odds of keeping any given occurrence.


does that mean you could also split the set in half multiple times then run it on each half of a half (etc) and combine it with its other half?

that would seem simpler to me.

edit: oh but then you would need to keep the results which defeats the purpose


You would need to assume a uniform distribution of items, which I don’t think this does


Let’s prove it by contradiction: Lets say you pick the larger ones and drop the smaller ones every single round, you have lost the probabilistic guarantee of 1/2^k that the authors show because the most frequent words will be the lost frequent in subsequent rounds as well. This is the intuition, the math might be more illuminating.


What are the main applications of this?


So now we have you to blame for a delay on the release of his next book. :)


Not me, the authors -- I'm a fly on the wall compared to them ;) This is some serious work, I just did a fast implementation. Re implementation -- it turns out that there are parts of some standard libraries that this problem pushes to its limits, that we had to go around during implementation. So there were still some cool challenges involved. I was also pretty happy about the late binding/lazy evaluation thing I came up with. Of course Knuth just did it (check his notes), without even thinking about it :D What is an achievement for me is a lazy Monday coffee for him, but oh well!


I agree with zero_k on everything he said about Knuth and strongly disagree with his own (extremely modest) characterization of himself.


I am pretty confident that this was not the only problem/algorithm that "distracted" Knuth during the years. You can see that whenever he encounters interesting issues he has no problem pausing the work on TAOCP and pursue other goals.


This algorithm seems to resemble HyperLogLog (and all its variants), which is also cited in the research paper. Using the same insight of the estimation value of tracking whether we've hit a "run" of heads or tails, but flipping the idea on its head (heh), it leads to the simpler algorithm described, which is about discarding memorized values on the basis of runs of heads/tails.

This also works especially well (that is, efficiently) in the streaming case, allowing you to keep something resembling a "counter" for the distinct elements, albeit with a error rate.

The benefit of HyperLogLog is that it behaves similarly to a hash set in some respects -- you can add items, count distinct them, and, importantly, merge two HLLs together (union), all the while keeping memory fixed to mere kilobytes even for billion-item sets. In distributed data stores, this is the trick behind Elasticsearch/OpenSearch cardinality agg, as well as behind Redis/Redict with its PFADD/PFMERGE/PFCOUNT.

I am not exactly sure how this CVM algorithm compares to HLL, but they got Knuth to review it, and they claim an undergrad can implement it easily, so it must be pretty good!


It’s also possible to use HLL to estimate the cardinality of joins since it’s possible to estimate both the union and the intersection of two HLLs.

http://oertl.github.io/hyperloglog-sketch-estimation-paper/


It's a really interesting open problem to get the cost of these down so that they can be used to heuristically select the variable order for worst case optimal joins during evaluation.

It's somewhere on the back of my todo list, and I have the hunch that it would enable instance optimal join algorithms.

I've dubbed these the Atreides Family of Joins:

  - Jessicas Join: The cost of each variable is based on the smallest number of rows that might be proposed for that variable by each joined relation.
  - Pauls join: The cost of each variable is based on the smallest number of distinct values that will actually be proposed for that variable from each joined relation.
  - Letos join: The cost of each variable is based on the actual size of the intersection.
In a sense each of the variants can look further into the future.

I'm using the first and the second in a triplestore I build in Rust [1] and it's a lot faster than Oxigraph. But I suspect that the constant factors would make the third infeasable (yet).

1: https://github.com/triblespace/tribles-rust/blob/master/src/...


Having read something vaguely related recently [0] I believe "Lookahead Information Passing" is the common term for this general idea. That paper discusses the use of bloom filters (not HLL) in the context of typical binary join trees.

> Letos join

God-Emperor Join has a nice ring to it.

[0] "Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis" - https://www.vldb.org/pvldb/vol16/p2962-zhang.pdf


Thanks for the interesting paper!

  We now formally define our _God-Emperor Join_ henceforth denoted join_ge...
Nice work with TXDB btw, it's funny how much impact Clojure, Datomic and Datascript had outside their own ecosystem!

Let me return the favour with an interesting paper [1] that should be especially relevant to the columnar data layout of TXDB. I'm currently building a succinct on-disk format with it [2], but you might be able to simply add some auxiliary structures to your arrow columns instead.

1: https://aidanhogan.com/docs/ring-graph-wco.pdf

2: https://github.com/triblespace/tribles-rust/blob/archive/src...


> Nice work with TXDB btw

It's X.T. (as in 'Cross-Time' / https://xtdb.com), but thank you! :)

> 1: https://aidanhogan.com/docs/ring-graph-wco.pdf

Oh nice, I recall skimming this team's precursor paper "Worst-Case Optimal Graph Joins in Almost No Space" (2021) - seems like they've done a lot more work since though, so definitely looking forward to reading it:

> The conference version presented the ring in terms of the Burrows–Wheeler transform. We present a new formulation of the ring in terms of stable sorting on column databases, which we hope will be more accessible to a broader audience not familiar with text indexing


My apologies! It's even more emberrasing given the fact that I looked up the website, knowing that I _always_ swap them after having written too many `tx-listen` in my career.

They expanded their work to wider relations and made the whole framework a lot more penetrable. I think they over-complicate things a bit with their forward-extension, so I'm keeping every column twice (still much better than all permutations), which in turn allows for ad-hoc cardinality estimation.

Also the 1 based indexing with inclusive ranges is really not doing them any favours. Most formula become much more streamlined and simpler with 0 based indexing and exclusive ranges. (see `base_range` and `restrict_range` in [0])

0: https://github.com/triblespace/tribles-rust/blob/e3ad6f21cdc...


Iirc intersection requires the HLLs to have similar cardinality, otherwise the result is way off.


You could merge these data structures as well. If the two instances to be merged are not at the same "round", take the one that's at an earlier round and advance it (by discarding half the entries at random) by the difference in rounds. Then just insert the values from one list to the other, ignoring duplicates; if the result is too large, discard half at random and increment the round number.

I implemented exactly this algorithm at my previous employer, except that alongside each value, we stored an estimate of the number of times that value appeared. This allowed us to generate an approximate list of the most common values and estimated count for each value.


Merging like that doesn't work -- it will tend to overestimate the number of distinct elements.

This is fairly easy to see, if you consider a stream with some N distinct elements, with the same elements in both the first and second halves of the stream. Then, supposing that p is 0.5, the first instance will result in a set with about N/2 of the elements, and the second instance will also. But they won't be the same set; on average their overlap will be about N/4. So when you combine them, you will have about 3N/4 elements in the resulting set, but with p still 0.5, so you will estimate 3N/2 instead of N for the final answer.

I have a thought about how to fix this, but the error bounds end up very large, so I don't know that it's viable.


Was there, reviewed the PR, can confirm. Hi Steve!

Since then we've also tuned it up in a couple ways, in particular adding "skip" logic similar to fast reservoir sampling to trade some accuracy for the ability to not even look at the next N {M,G,T}B if you've already seen many many many matches. For non-selective searches over PB of data it's a good tradeoff, despite introducing some search-order bias.


Just curious, dusting off my distant school memories :)

How do the HLL and CVM that I hear about relate to reservoir sampling which I remember learning?

I once had a job at a hospital (back when 'whiz kids' were being hired by pretty much every business) where I used reservoir sampling to make small subsets of records that were stored on DAT tapes.


I guess there is a connection in the sense that with reservoir sampling, each sample observed has an equal chance of remaining when you're done. However, if you have duplicates in your samples, traditional algorithms for reservoir sampling do not do anything special with duplicates. So you can end up with duplicates in your output with some probability.

I guess maybe it's more interesting to look at the other way. How is the set of samples you're left with at the end of CVM related to the set of samples you get with reservoir sampling?


Was wondering the same, thanks for an answer.


Knuth's presentation of this [1] seems very very similar to the heap-version (top-k on a uniform deviate) of reservoir sampling as mentioned in [2]. The difference is in how duplicates are handled. I wouldn't be surprised if this algorithm was in fact already in use somewhere!

[1] https://cs.stanford.edu/~knuth/papers/cvm-note.pdf [2] https://florian.github.io/reservoir-sampling/

Edit: Another commenter [3] brought up the BJKST algorithm which seems to be similar procedure except using a suitably "uniform" hash function (pairwise independence) as the deviate instead of a random number.

[3] https://news.ycombinator.com/item?id=40389178


I found the paper took about as long to read as the blog post and is more informative:

https://arxiv.org/pdf/2301.10191

It is about estimating the cardinality of a set of elements derived from a stream. The algorithm is so simple, you can code it and play with it whilst you read the paper.

The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.


If you refer to the subtitle of the paper - An Algorithm for the (Text) Book - I think that is actually a reference to something *Paul Erdos allegedly said about some proofs are so elegant in their simplicity and beauty that they are "from The Book", like representing some divine Platonic ideal.

Given that Knuth himself reviewed it, he might have remarked that this was one of those algorithms! Perhaps the authors decided to include it in the title as a not-so-humble brag (which would be well-earned if that's the case!)

edit: originally this comment said Knuth was the one who said this about some algorithms being from The Book, but that was my faulty memory.


I liked this part. They got Knuth to review it, and found mistakes. That's kind of cool, in its own way.

    We are deeply grateful to Donald E. Knuth for his thorough review, 
    which not only enhanced the quality of this paper (including fixing
    several errors) but has also inspired us for higher standards.


From the abstract: "... All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory ...."


This is really twisting it, I don't find pairwise indepedence to be more advanced than applying a Chernoff bound. In this problem it'd mostly be the difference of using a Cherbyshev type bound or Chernoff bound.

Pairwise independence is to give the algorithm stronger guarantees and let it work with a simple hash function like ax+b, otherwise probably most existing algorithms can be simplified in the same way. The most similar algorithm is BJKST, which is almost identical except for specifyimg the sampling mechanism to require less randomness.

To someone who worked on this type of thing before, it's like seeing people familar with LLMs say "oh yet another X-billion parameter model on github doing more or less the same".


The Chernoff bound needed in this work can be derived from Binomial distribution (with Stirling's approximation);

I have worked on pairwise independent hash functions for a decade and every time I introduce such a function, it feels like magic. The notion of pairwise independence is easy to explain but the notion of pairwise independent hash functions isn't.

The other strength of our work is that it can work for general settings of sets for which pairwise independent hash functions are not known. Please see: https://dl.acm.org/doi/10.1145/3452021.3458333


The point is that the subtitle's is pretty clearly intended to have a dual meaning, it wouldn't be phrased like that otherwise.


I thought The Book was an Erdos thing. I wonder who used it first.


"During a lecture in 1985, Erdős said, `You don't have to believe in God, but you should believe in The Book.`"

https://en.wikipedia.org/wiki/Proofs_from_THE_BOOK


I think you're right, I must have confused the two. I'll edit my comment to reduce the spread of misinformation.


The blog post was more than half padding. Good that the algorithm is so simple it's hard to write a full length blog post about it!


And yet the blog post still got it wrong:

> Now you move forward with what the team calls Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

It's not just removals you test with N coin flips in Round N, it's whether to insert the new item at all.


I originally used Guttenburgh to get Hamlet and coded the Quanta method in Python and it did not work. I then moved to Algorithm 1 in the paper and got Copilot to (mis) convert it to Python and then spent time getting Copilot to admit its mistakes. The resultant code seemed to work but I found the Quanta suggested data of the words of hamlet to be uninspiring as for the calculated theta (max set size before halving), was often from ~50% of the total number of words in hamlet to often more than the words in hamlet. I've yet to investigate theta in more depth...


Yeah, I noticed the same thing. Quanta's version of the algorithm is not only confusing, it's also wrong.

I think the pseudocode in the paper is very hard to beat.


I agree the paper is better than the blog post, although one criticism I have of the CVM paper is that it has some termination/algo exit condition instead of what Knuth's CVM notes (refed else-thread here) do which is just a loop to ensure getting more space in the reservoir halving-step. It seems more work to explain the https://en.wikipedia.org/wiki/Up_tack than just do the loop. [1]

[1] https://news.ycombinator.com/item?id=40388878


You are indeed right; while has the added advantage of making the estimator unbiased -- i.e., not only strongly (epsilon,delta)-guarantees but also having an expectation of being correct).

It wasn't easy to see that loop would have added benefit -- that's where Knuth's ingenuity comes in.


On that note, I'm also unfamiliar with this \ operator notation which is used without explanation.

    X ← X \ {ai}


That is conventional set subtraction notation. "Assign to X the same set minus all elements of the set {a_i}".

One example source, but it is pretty common in general: http://www.mathwords.com/s/set_subtraction.htm


Set difference.

Set X becomes X without element ai. This is the case whether ai was in the set X before the step was taken.


I've known that symbol for decades, never knew it's name - up-tack it is. Ta!


It's been a while, and maybe my brain has smoothened since my time in CS, but man this looks more confusing than it needs to be.

First, the contradiction thing. It's just.. An error/panic, why? Anyway, fine.

Then, there's the whole premise of 1..m: I'm still not sure if the size needs to be known upfront or not. Looking at it a bit more, it seems like no you don't. You pick a threshold, and then depending on the size of the stream the probability changes. But the algorithm is described as if it had a single output, which is not the case(?).

And btw, I didn't know about the Chernoff bounds and delta/epsilon are not explained at all in the paper, which added to the confusion a lot.

Anyway, here's my take in Golang: https://github.com/betamos/distinct

I factored out the threshold parts into a helper instead, which makes a lot more sense than accidentally allocating too much memory.

Perhaps there should be a method for estimating the confidence/error rate. Nobody knows the size of a stream upfront, so it would make more sense to update these values as you go. Brain is not strong enough to implement it, but feel free to send a PR.


[I am one of the authors].

We have a follow-up work (admittedly, more technical) that can remove reliance on m completely: https://www.cs.toronto.edu/~meel/Papers/pods22.pdf

But yes, our theorems can be reworked to estimate the confidence/error rate; that's what Knuth did: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf


Didn’t realize you were here so let me be clear that I did overall find the paper so approachable that I could implement it with only a couple of outside pointers (also a little clever impl optimization around storing p if you’re curious). The above should be read more as “even this well-written simplified paper is not necessarily trivial to understand for practicians”. So more of a general point around academic obscurity.

> But yes, our theorems can be reworked to estimate the confidence/error rate

I think that’s useful for practical implications. Also, for practical use, how does one decide the tradeoff between delta and epsilon? Perhaps it’s covered elsewhere, but I have a hard time intuiting their relationship.


I fully agree with you and this is indeed one of my criticisms of modern academic writing -- we tend to write papers that are just very hard for anyone to read.

So delta refers to the confidence, i.e., how often are you willing to be wrong, and epsilon is tolerance with respect to the actual count.

We have found that in general, setting delta=0.1 and espilon=0.8 works fine for most practical applications.


> First, the contradiction thing. It's just.. An error/panic, why? Anyway, fine.

I suspect due to the details of the proof. This condition looks likely to me for very small set cardinality making the algorithm inappropriate for all-weather. See page 3 of the paper where Algorithm 2 is introduced. They show that in the failure condition, the likelihood of the algorithm returning a value outside of the epsilon bounds is higher.

> Then, there's the whole premise of 1..m: I'm still not sure if the size needs to be known upfront or not.

m sizes the threshold, if it is too small, the error bounds guaranteed by the algorithm will be larger than expected, and vice versa if m is too large.

> And btw, I didn't know about the Chernoff bounds and delta/epsilon are not explained at all in the paper, which added to the confusion a lot.

Papers typically do not spend words on basics and those are standard concepts.

> Perhaps there should be a method for estimating the confidence/error rate.

You can't resize the stream easily because p implicitly depends on m via the thresh cardinality condition and if you were to change m then your p updates would not correspond. As a result you may not be able to rely on the error bounds. Note though that stream doesn't mean infinite or very large: take it to mean one item at a time. The point of this algorithm is to bound space complexity to something small. Have a look at thresh and note that log2(1e50) is just 166: if you did have a very large stream of indeterminate length you could also just pick a very large m.


> "The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks."

If you're saying it's just for "undergraduates and textbooks", as opposed to just being simple enough for them to use but not limited to them, would you mind explaining what makes it useful for undergrads but not for professionals?


My interpretation from the paper is that this algorithm is simpler than other options but also worse. So in a professional context you'd use one of those instead


From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."


That still only speaks to it being simple enough for students, not whether its too simple for any other use vs. useful enough that students who learn it will spend the rest of their lives using it.

For example word processor software is commonly described as simple enough for children to use at school, that doesn't mean that word processor software is of no use to adults.


Simplicity in an algorithm has limited direct impact on its usefulness in industry where libraries are prevalent.

Consider mapping structures with Θ(k) expected lookup times. The simplest implementation is a hash table with chaining. Open-addressing is a bit more complicated, but also more common. Tries, which have O(k) worst-case lookup times are often covered in Undergraduate courses and are definitely easier to analyze and implement than forms of open-addressed hash-tables that have O(k) guarantees (e.g. Cuckoo hashing).


the reason it's too simple for most real world use is that hyper-log-log is the "good" version of this technique (but is harder to prove that it works)


>The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.

Doesn't seem like it. Seems like an algorithm (similar to other approximate cardinality estimation algorithms) with huge applicability.


From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."


That just says that this is also simpler and more accessible algorithm, suitable even for undegraduate textbooks.

Not that this is just useful for textbooks, a mere academic toy example, which would be something else entirely.

This is both accessible AND efficient.



Given the topic of the paper[0], the footnote is especially charming:

> the authors decided to forgo the old convention of alphabetical ordering of authors in favor of a randomized ordering, denoted by r⃝. The publicly verifiable record of the randomization is available at https://www.aeaweb.org/journals/policies/random-author-order...

[0]: https://arxiv.org/pdf/2301.10191

edit: formatting


Is it me or is the description of the algo wrong?

    > Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word;
If i follow this description of "check if exists in list -> delete":

    if hash_set.contains(word) {
        if !keep_a_word(round) {
            hash_set.remove(word);
            continue;
        }
    } else {
        hash_set.insert(word.to_string());
    }
The algorithm runs for ~20 iterations:

    Total word count: 31955 | limit: 1000
    End Round: 20, word count: 737
    Unique word count: 7233
    Estimated unique word count: 772800512
But if I save the word first and then delete the same word:

    hash_set.insert(word.to_string());

    if !keep_a_word(round) {
        hash_set.remove(word);
        continue;
    }
It gets the correct answer:

    Total word count: 31955 | 1000
    End Round: 3, word count: 905
    Unique word count: 7233
    Estimated unique word count: 7240


I got the same problem.

When implementing the exact method as described in quanta magazine (without looking at the arxiv paper), I always had estimates like 461746372167462146216468796214962164.

Then after reading the arxiv paper, I got the the correct estimate, with this code (very close to mudiadamz's comment solution):

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem =  set()  
    for k in L:
        if k in mem:
            mem.remove(k)
        if np.random.rand() < p:
            mem.add(k)
        if len(mem) == thresh:
            mem = {m for m in mem if np.random.rand() < 0.5}
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Or equivalently:

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem = []
    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Now I found the quanta magazine formulation problem. By reading:

> Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

we want to write:

    for k in L:
        if k not in mem:
            mem += [k]
        else:
            if np.random.rand() > p:
                mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
whereas it should be (correct):

    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:    # without the else
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
Just this little "else" made it wrong!


Yes, there is an error in the Quanta article [at the same time, I must add that writing popular science articles is very hard, so it would be wrong to blame them]

Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing: mem = [m for m in mem if np.random.rand() < 0.5]

["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]


Quanta:

    Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

To:

    Round 1. Keep going through Hamlet, but now flipping a coin for each word. If it’s tails, delete the word if it exists; heads, and add the word  if it's not already on the list.

Old edit:

    Round 1. Keep going through Hamlet, adding words but now flipping a coin immediately after adding it. If it’s tails, delete the word; heads, and the word stays on the list.


> adding words but now flipping a coin immediately after adding it

Edit: I thought your formulation was correct but not really:

We flip the coin after adding, but we also flip the coin even if we didn't add the word (because it was already there). This is subtle!

wrong:

    if k not in mem:
        mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
wrong:

    if k not in mem:
        mem += [k]
    else:
        if np.random.rand() > p:
            mem.remove(k)
correct:

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)
correct:

    if k not in mem:
        mem += [k]
    if np.random.rand() > p:
        mem.remove(k)


The following is also not correct.

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)
Your final solution is indeed correct, and I think more elegant than what we had in our paper [I am one of the authors].


Ah, I'm using a set instead of list so I just always add and then toss remove.


Was just now solving it and came to see if others had the same issue. Yep, you are right.

    function generateRandomNumbers(c, n) {
      let randomNumbers = new Array(c);
      for (let i = 0; i < randomNumbers.length; i++) {
        let randomNumber = Math.floor(Math.random() * (n + 1));
        randomNumbers[i] = randomNumber;
      }
      return randomNumbers;
    }
    function run(w, wS, m, r) {
        function round(r) {
            while(wS.size < m) {
                const next = w.next()
                if (next.done) return true;
                wS.add(next.value)
                prune(next.value, r)
            }
            return false
        }
        function prune(v,r) {
            for (let i = 0; i < r; i++) {
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(v)
                }
            }
        }
        function purge(wS) {
            const copy = new Set(wS)
            copy.forEach(ith=>{
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(ith)
                }
            })
        }
        const done = round(r);
        if (!done) {
            purge(wS)
            return run(w, wS, r+1,m)
        }
        console.log(`Round ${r} done. ${wS.size} Estimate: ${wS.size / (1/Math.pow(2,r))}`)
    }
    const memory = 1000
    const words = generateRandomNumbers(3000000,15000)
    const w = words[Symbol.iterator]() // create an iterator
    const wS = new Set();
    run(w,wS, memory,0);


Noticed an error;

    return run(w, wS, r+1,m)
Should be changed to:

    return run(w, wS, m, r+1)


Estimating the amount of unique elements in a set and counting the amount of unique elements in a set are very different things. Cool method, bad headline.


They’re not very different things; the terms are used interchangeably in most contexts because in the real world all counting methods have some nonzero error rate.

We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

I kind of think the mythology of the ‘countless stones’ (https://en.wikipedia.org/wiki/Countless_stones) is a sort of folk-reminder that you can never be too certain that you counted something right. Even something as big and solid and static as a standing stone.

The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.


> the terms are used interchangeably in most contexts

Counting and estimating are not used interchangeably in most contexts.

> because in the real world all counting methods have some nonzero error rate.

The possibility that the counting process may be defective does not make it an estimation.

> We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

We talk about counting votes in elections because votes are counted. The fact that the process isn't perfect is a defect; this does not make it estimation.

> That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

No. Exit polling is estimation. Vote counting is counting. Vote recounting is also counting, and does not necessarily impose a tighter error bound, nor necessarily derive a different number.

> The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

So like, computers? Regardless, this is wrong. Estimating something and counting it are not the same thing. Estimation has uncertainty, counting may have error.

This is like saying addition estimates a sum because you might get it wrong. It's just not true.


So, IEEE floating point doesn’t support ‘addition’ then.


IEEE 754 defines an exact binary result for the addition of any two floats.

That this bit-identical result is not the same operation as addition of real numbers is irrelevant, because floats aren't reals.

f1 + f2 is not an estimation. Even treating it as an approximation will get you into trouble. It's not that either, it's a floating-point result, and algorithms making heavy use of floating point had better understand precisely what f1 + f2 is going to give you if they want to obtain maximum precision and accuracy.


Cool, so next time I have numbers that aren't reals to perform math on, I can use floats.


Or if you have numbers that aren't integers to perform math on, you can use integers.

It's not a new problem, and it isn't specific to floats. Computers do discrete math. Always have, always will.


Come on. There is a fundamental difference between trying to get an exactly answer and not trying to get an exactly correct answer.


It’s not a fundamental difference, it’s a fundamental constraint.

There are circumstances - and in real life those circumstances are very common - where you must accept that getting an exactly correct answer is not realistic. Yet nonetheless you want to ‘count’ things anyway.

We still call procedures for counting things under those circumstances ‘counting’.

The constraints on this problem (insufficient memory to remember all the unique items you encounter) are one such situation where even computerized counting isn’t going to be exact.


I agree with you, but we are talking theory here. The algorithm doesn't count, it estimates.

You can make an algorithm that counts, you can make an algorithm that estimates, this is the second.


Estimation is counting with error bars.

Frankly, most of what you consider counting in your comment needs error bars - ask anyone who operated an all-cash cash-register how frequently end-of-day reconciliation didn't match the actual cash in the drawer (to the nearest dollar.)

The following is a list from my personal experience - of presumably precisely countable things that didn't turn out to be the case: the number of computers owned by an fairly large regional business, the number of (virtual) servers operated by a moderately sized team, the number of batteries sold in a financial year by a battery company.


Counting is a subset of estimation, not a synonym.

If I estimated the number of quarters in a stack by weighing them, that would be different from estimating the number of quaters in a stack by counting them. Both methods of estimation have error bars.

The list you provide is of categories that don't have clear definitions. If you have a sufficiently clear definition for a category given your population, it has a precise count (though your counting methodologies will still be estimates.) If your definition is too fuzzy, then you don't actually have a countable set.


It's close enough to counting for the purposes of a magazine article like uuids are close enough to being unique for the purposes of programming.


The algorithm accuracy scales with the ratio of memory to set size so you don't actually know if it is "close enough" without an estimate of of the set size.

I think the headline is clickbaity and the article makes no effort to justify it's misuse of the wors 'counting'. The subheadline is far more accurate and doesn't use that many more words.


I think I get your point completely, yet I'm not getting through.

Would you agree that 1+1=2? Or that pi is 3.14159...? These are mathematical truths, but quickly crumble in the real world. One apple plus one apple doesn't just equate to double the apple, no two apples are ever the same to begin with, there are no real perfect circles out there either, there is still value to those mathematical truths in that they make it evident that they are perfectly precise and that it is real world interaction which may bring error into the table.


Counting and estimation are different by definition. One is a full enumeration, the latter an extrapolation from sampled data. In both cases 'accuracy' is a factor. Even if we are counting the number of stars, it is still a difference of technique compared to estimating the number if stars.

I could try to count fibers in muscle or grains of sand in the beach, chances are accuracy would be low. One can be smart about technique for more accurate counts, eg: get 10M sand counters and give them each 1kg of sand which they then count the grains with tweezer and microscope. That is counting. At the same time, we could find an average count of grains in 1kg sand from a random 100 of our counters, and then estimate what an expected total would be. The estimate can be used to confirm the accuracy of the counts.


They are not really as far apart a you think. At small numbers, yes get are distinct. At large enough numbers, they in all practicality the same thing. E.g what’s the population of the US


And by that definition this is a counting algorithm.


True - for (relatively) small numbers. For large (huge) numbers estimation is usually considered to be equivalent to counting, and the result is sometimes represented using the "scientific" notation (i.e. "floating-point") rather than as an integer. For example, the mole is an integer whose value is only known approximately (and no one cares about the exact value anyway).


As of May 2019, the mole has an exact value, and Carbon-12's molar mass is the empirically-determined value.


This doesn't justify estimation to be equivalent to counting even if some mathematicians consider them to be the same. Floating points are for estimation. Integers are for counting. The two are not the same, not even for large numbers.


"Equivalent" and "the same" are sometimes equivalent. (Or the same.)

It depends on what the meaning of the word 'is' is.

https://libquotes.com/bill-clinton/quote/lby0h7o


It's an approximation, not an estimation.


Actually, my understanding is that it is an estimation because in the given context we don't know or cannot compute the true answer due to some kind of constraint (here memory or the size of |X|). An approximation is when we use a simplified or rounded version of an exact number that we actually know.


Wikipedia is on your side:

"In mathematics, approximation describes the process of finding estimates in the form of upper or lower bounds for a quantity that cannot readily be evaluated precisely"

This process doesn't use upper and lower bounds.

However, it still seems more like approximation than estimation to me because of this:

“Of course,” Variyam said, “if the [memory] is so big that it fits all the words, then we can get 100% accuracy.

It seems that in estimation the answer should be unknowable without additional information, whereas in this case it's just a matter of resolution or granularity because of the memory size.

Anyhoo ...

EDIT: also the paper says "estimate" and the article says both "approximate" and "estimate" at different times so it seems everyone except me thinks it's either an estimation or that estimation and approximation are interchangeable.


Still very different things, no?


It's the same thing at different degrees of accuracy. The goal is the same.


Still, counting things and counting unique things are two different procedures.


For someone who's pretty well-versed in English, but not a math-oriented computer scientist, this seems like a distinction without a difference. Please remedy my ignorance.


My GP was wrong, but the words are different.

Eatimation is a procedure the generates an estimate, which is a kind of approximation, while approximation is a result value. They are different "types", as a computer scientist would say. An approximation is any value that is justifiably considered to be nearly exact. ("prox" means "near". See also "proximate" and "proxy".)

Estimation is one way to generate an approximation. An estimate is a subtype of an approximation. There are non-estimation ways to generate an approximation. For example, take an exact value and round it to the nearest multiple of 100. That generates an approximation, but does not use estimation.


I’m not sure the linguistic differences here are as cut and dried as you would like them to be. Estimate and approximate are both verbs, so you can derive nouns from them both for the process of doing the thing, and for the thing that results from such a process.

Estimation is the process of estimating. It produces an estimate.

Approximation is the process of approximating. It produces an approximation.

You can also derive adjectives from the verbs as well.

An estimate is an estimated value.

An approximation is an approximate value.

But you’re right that the ‘approximate’ terms make claims about the result - that it is in some way near to the correct value - while the ‘estimate’ derived terms all make a claim about the process that produced the result (ie that it was based on data that is known to be incomplete, uncertain, or approximate)


The authors of the article disagree with you.


The authors of the paper disagree with me, the authors of the article don't (they use both approximate and estimate, but the paper does say estimate).


I don't know a word or phrase for this, but I really enjoy any examples of "thinking outside the box" like this because it's something I struggle with in my professional career. Learning not only the right ways to solve problems, but figuring out the questions to ask that make solving the problems you have easier or even in some cases possible. In this case, it's hey, we don't need exact numbers if we can define a probabilistic range given defined parameters. Other problems are gonna have other questions. I guess my hope is that if I see enough examples I'll be able to eventually internalize the thought process and apply it correctly.


To be fair, this was a university research team. Literally, a team of folks who can, all day everyday, iterate over a single topic using the Scientific Method.

If you were paid by a big company to sit at a whiteboard all day with a team of equally intelligent engineers, I'm sure you'd be come up with SOMETHING that would look like an "outside the box" solution to the rest of the world.

However, most of us are paid to work the JIRA factory line instead, which limits the amount of time we can spend experimenting on just one single problem.


I think it's generally thought of as "lateral thinking", Edward de Bono has written a few books about it you might find interesting.


And some more commonplace words like "creativity" (as in "creative solution") etc. would apply.


any particular one you'd recommend?


I think the classic is "Lateral Thinking: A Textbook of Creativity"


>What if you’re Facebook, and you want to count the number of distinct users who log in each day, even if some of them log in from multiple devices and at multiple times?

Seems like a bad example of when this algorithm could be useful in practice.

If you already know you will want this info when designing the login process, it's simple: keep track of the date of last login for each account, and increment the unique user counter when the stored value is different from the current one.

And even if not, it should still be possible to "replay" a stream of login events from the database later to do this analysis. Unless maybe you already had years worth of data?


No, what you propose needs to "keep track of the date of last login for each account", so you need memory of the size of your user population. The idea is to perform it with much smaller, fixed amount of memory.


On the topic of counting things, I would like to mention this efficient and easily-implemented algorithm for finding the top-k items in a stream, which I think is perhaps not as well known as it should be:

A Simple Algorithm for Finding Frequent Elements in Streams and Bags

Karp, Shenker & Papadimitriou

https://www.cs.umd.edu/~samir/498/karp.pdf


> the top-k items in a stream

Hmm, this is phrased in a way that sounds different (to my ears) than the abstract, which says:

> it is often desirable to identify from a very long sequence of symbols (or tuples, or packets) coming from a large alphabet those symbols whose frequency is above a given threshold

Your description suggests a finding fixed nr of k items, with the guarantee that it will be the top ones. The abstract sounds like if determines an a priori unknown number of items that meet the criteria of having a particular value greater than k.

So "find the 100 oldest users" vs "find all users older than 30".

Am I misunderstanding you or the abstract? (English is not my first language)


It's been a while since I used this in anger, but my recollection is that it maintains a fixed-size set of items. The same algorithm is in section 3.2 of https://erikdemaine.org/papers/NetworkStats_TR2002/paper.pdf and might be clearer there.


Thank you, that is much easier to follow indeed! Very elegant, I agree :)


Computer scientists invent a memory-efficient way to estimate the size of a subset


It seems fast too as you can use less rounds of flips and get an estimate meaning you may not need to go thru the entire “book” to get an estimate of distinct words


The subset is very important, it is the subset of unique elements.


It's not a "subset," it's "the set of equivalence classes."


I think it's by definition that the unique elements of a set is a subset. Nonetheless, the clarification is a good point.

To take a stab at a yet more correct statement, maybe something like the follows captures it: In the algorithm noted, we are extrapolating the size of the unique elements by looking at additional subsets, which themselves are an equivalence class.


This is implied by the word "set" in "subset". Otherwise it would've been a multiset.


We are very grateful for the interest, and I thought I would link to some relevant resources.

Paper: https://arxiv.org/pdf/2301.10191 Knuth's note: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

Talk Slides: https://www.cs.toronto.edu/~meel/Slides/meel-distinct.pdf Talk Video: https://www.youtube.com/watch?v=K_ugk7OW0bI

The talk also discusses the general settings where our algorithm resolved the open problem of estimation of the union of high dimensional rectangles.


When do we stop calling this counting and start calling it estimation?


Seems this is one of those things like UUIDs where we rely on it being very unlikely to be wrong, because statistics.

> the accuracy of this technique scales with the size of the memory.

I wonder if that's proportional to the number of distinct items to count, though.

> if the [memory] is so big that it fits all the words, then we can get 100% accuracy

Yes, but then the algorithm isn't being used any more, that's just normal counting.

They counted the distinct words in Hamlet with a memory size of 100 words, about 2.5% of the number to find, and got a result that was off by 2. If you do the same with the whole of Shakespeare, again using 2.5% of the memory needed to hold all the distinct words, is the accuracy better?

Anyway, this is limited to counting, and doesn't help list what the words are, though quickly counting them first is perhaps a way to speed up the task of actually finding them?


When it is actually impossible to count something and when the error between estimation and an exact answer is not significant the pedantic distinction is not helpful.

The same thing happens with measurement. No measurement is ever exact. If I said I was measuring a count, someone would probably correct to say that I am counting.

Common speech is like that.


As soon as people start reading past the headline.


tbh, the title (and introduction) did a lot to dissuade me from finishing the (really good) article. It was actually informative, why dress it as a SEO blogspam?


Presumably so it is optimised for search engines and people find it.

Publishers generally have good data about where their audience comes from. They wouldn't do this if it wasn't the best way they know of to maximise readership.


I've been following Quanta for some time, I'm sure they don't care about SEO and number of visitors, they care about the quality of texts. They write for the general audience, and even though they try to preserve the scientific accuracy, sometimes their explanations may seem oversimplified and even a bit confusing when you come from the same field. It's not clickbait, it's their popular science style.


So they can get paid for their work? Are you giving them anything?


It's Quanta. Their mission is to make laypeople like math (not understand math), so they drown the math in sugar.


Yes, even the subtile states "a simple algorithm for estimating large numbers of distinct objects".


This doesn't excuse the title though.


Python implementation:

  def streaming_algorithm(A, epsilon, delta):
      # Initialize parameters
      p = 1
      X = set()
      thresh = math.ceil((12 / epsilon ** 2) * math.log(8 * len(A) / delta))

      # Process the stream
      for ai in A:
          if ai in X:
              X.remove(ai)
          if random.random() < p:
              X.add(ai)
          if len(X) == thresh:
              X = {x for x in X if random.random() >= 0.5}
              p /= 2
              if len(X) == thresh:
                  return '⊥'

      return len(X) / p


  # Example usage
  A = [1, 2, 3, 1, 2, 3]
  epsilon = 0.1
  delta = 0.01

  output = streaming_algorithm(A, epsilon, delta)
  print(output)


I don't think there is a single variable name or comment in this entire code block that conveys any information. Name stuff well! Especially if you want random strangers to gaze upon your code in wonder.


The names are literally taken from the paper.


well the paper also contains the code so I doubt anyone who looked at the paper cares about this paste - for folks who did not read the paper this is not very readable


> Name stuff well

OP is following the same variable names of the article. I prefer that over changing the variable names and then figuring out what variable name maps in code to the article.


Speaking of, one of my favorite discoveries with Unicode is that there is a ton of code points acceptable for symbol identifiers in various languages that I just can't wait to abuse.

>>> ᚨ=3

>>> ᛒ=6

>>> ᚨ+ᛒ

9


You're even using a and b, very good.


Ah yes, programming like the vikings intended.


That's not streaming if you're already aware of the length of the iterable.


In python, you can simply substitute `A` with an iterable or generator object, which can be a of unknown length.


But for this algorithm, you need to know the total length ("m") to set the threshold for the register purges.

Does it still work if you update m as you go?


Besides the ideas from istjohn, empath-nirvana, and rcarmo, you can also just "flip the script": solve for epsilon and report that as 1-delta confidence interval for the worst case data distribution as here: https://news.ycombinator.com/item?id=40388878

Best case error is of course zero, but if you look at my output then you will see as I did that the worst case is a very conservative bound (i.e. 15X bigger than what might "tend to happen". That matters a lot for "space usage" since the error =~ 1/sqrt(space) implying you need a lot more space for lower errors. 15^2 = 225X more space. Space optimization is usually well attended for this kind of problem. And, hey, maybe you know something about the input data distribution?

So, in addition to the worst case bound, average case errors under various distributional scenarios would be very interesting. Or even better "measuring as you go" enough distributional meta data to get a tighter error bound. That latter starts to sound like it's one of Knuth's Hard Questions Which if You Solve He'll Sign your PhD Thesis territory, though. Maybe a starting point would be some kind of online entropy(distribution) estimation, perhaps inspired by https://arxiv.org/abs/2105.07408 . And sure, maybe you need to bound the error ahead of time instead of inspecting it at any point in the stream.


You would want to calculate the threshold by choosing your target epsilon and delta and an 'm' equal to the largest conceivable size of the stream. Fortunately, the threshold increases with log(m), so it's inexpensive to anticipate several orders of magnitude more data than necessary. If you wanted, you could work backwards to calculate the actual 'epsilon' and 'delta' values for the actual 'm' of the stream after the fact.


You actually don't need to do that part in the algorithm. If you don't know the length of the list, you can just choose a threshold that seems reasonable and calculate the margin of error after you're done processing. (or i guess at whatever checkpoints you want if it's continuous)

In this example, they have the length of the list and choose the threshold to give them a desired margin of error.




  return '⊥'
what's this?


An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.

I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:

    from random import random

    def estimate_uniques(iterable, window_size=100):
        p = 1
        seen = set()

        for i in iterable:
            if i not in seen:
                seen.add(i)
            if random() > p:
                seen.remove(i)
            if len(seen) >= window_size:
                seen = {s for s in seen if random() < 0.5}
                p /= 2
        return int(len(seen) / p)
I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.


In some symbolic logic classes, that character "bottom" represents "false" ad flipped "top" means true.

Don't know what they're getting at in the code, though.


>In some symbolic logic classes, that character "bottom" represents "false"

That's unfortunate, because in the study of computer programming languages, it means "undefined" (raise an error).


Not always. It is also the uninhabited bottom type.


My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.

I have no idea what your point is.


Once again proving the need for comments in code. Especially for comments that are more useful than "initialize parameters"


An easy way to identify who copies code without understanding it.


You can just replace it with something like: print ('Invalid thresh or something')


This however looks scary so an innocent copy/paste programmer wouldn't touch it.


New leetcode hard question just dropped.


Is this ChatGPT? Also, I feel like this would be more useful if it included import statements.


  import math
  import random


HyperLogLog uses additions, it keeps sums. Thus, you can subtract one HLL sums from other. This is useful if stream supports deletion. Streams with deletions can be found in log-structured merge trees, for one example, so one can estimate count of distinct elements in all of the LSM tree hierarchy.

The algorithm in the paper does not allow for deletions.

Also, if one counts statistics of the stream of large elements (say, SHA-512 hashes, 64 bytes per hash), this algorithm requires some storage for elements from this stream, so memory requirement is O(table size * element size).


Huh, that's a clever twist on reservoir sampling. Neat.


From the paper [0]:

> We state the following well-known concentration bound, Chernoff bound, for completeness.

Which variant of the Chernoff bound is this? This is almost the (looser variant of the) multiplicative form, but it's not quite right (per the use of 1+delta instead of a single parameter beta). In particular, that bound is only guaranteed to hold for delta >= 0 (not beta = 1 + delta > 0 as asserted in the paper)

[0] https://arxiv.org/pdf/2301.10191

[1] https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_...

edit: to be clear: I'm not sure at all whether this breaks the proof of correctness, although I'm having a bit of difficulty following the actual details (I think I'd need to work through the intermediate steps on paper).


The algorithm is simple, which is why it is also suitable for textbooks. However, it is far from representing the state of the art. In case you are interested in the latest development, I have recently published a new distinct counting algorithm where the estimate does not depend on the processing order, which can be merged and requires only 3.5kB of memory to guarantee a standard error of 1.13% for counts up to exa-scale: https://arxiv.org/abs/2402.13726


I jumped on the code generation bandwagon pretty soon after ChatGPT was released. I like to think of similar to a catalyst in a chemical reaction, it lowers the activation barrier to writing code, especially in new contexts (to you). It makes it easier to just get started. However, it struggles with solving actual problems or even putting together the right context. If you know how to break the problem down to simpler steps, it can help with those. It won't write a new kernel driver for you.


Wrong thread?


IDK, if my explanation is correct, but I do believe it is. I t goes as follow.

Imagine that you have a container of potential limitless capacity. The container starts with smalls capacity, equal to the real limited capacity that the real algorithm uses. As you add elements, when the container is full, its capacity is doubled, but all elements are then placed in a random position.

When you're done, you're told the occupancy of the subset of the large container corresponding to the initial size and how many times the container size was doubled. Multiplying that occupancy by the power of two of the number of doubling gives you an approximation of the real size.

The small catch is that in the actual algorithm, due to the discarding, the final number of elements, the occupancy, is somewhat erroneous.

EDIT

Another way to say this: you got a container of limited capacity S. When full, you "virtually" double its size and then randomly move elements over the full "imaginary" size of the virtual container. So after the first filling, you end up with about 1/2 the elements. After the second filling 1/4, etc. Also, since now your "virtual" container is larger, when you add a new element, there is only 1/2^n the it will be place inside your limited-capacity view of the entire virtual container.

At the end, the approximate real count is the number of elements you got multiplied by 2 to the power of the number of size doubling.

Again, it is as if you have a small window into a limitless container.


The algorithm uses less memory, but more CPU time because of rather frequent deletions, so it's a tradeoff, not just generally better algorithm, as article may suggest.


the list is small so the cost of deletions should be small.


"Computer scientists invent an efficient new way to approximate counting large unique entries" - fixed the title


Ah, so Thanos was just conducting a census.


I see what you did there.


Does anyone else notice that quanta consistently has good illustrations and diagrams for their articles? Good visualizations and graphics can do extraordinary things for the understandability of a topic -- people like 3Blue1Brown understand this, and I'm glad quanta puts effort into it.


This is actually a really sad, redundant example of the regression of education in the United States.

This "sure to be lauded" technique is pretty simple and has been exploited by simpleton gamblers for a very long time.

Its basically simple math disguised as "mentalism".

But since you, "very new to " "the real world" people figure out your folly. I promise I wont laugh. But I am laughing at how incredibly hollowed out the education system of the USA is when a simple compund method for an old trick, substitutes itself for math.

The "math" involved is purely low IQ points on the inventors side.


After skimming Knuth's paper — does the algorithm work if values are hashed, that is, the "uniform deviate" is selected deterministically for each unique value of the stream?


Not sure which Knuth paper you're referring to but skimming through the article my understanding is this algorithm works /only/ if the values are hashable. IOW how else does one define unique/distinct values ?



https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

note how "u"s are selected every time value is not in a list. I don't read it as being a hash.


I think the analysis relies on independent random "u"s, even for the same key.


In the algorithm depicted in the paper, if no elements manage to be eliminated from the set, why not just retry rather than return ⊥?


Retrying would cause errors in the math. Personally I think a better modification would be to get into next round right away.


The counting/estimstion technique is rather like a floating point number. An integer exponent k and a mantissa of a population.


Whipped up a quick PHP version for fun:

https://github.com/jbroadway/distinctelements


New interview question: Tell me how you would estimate the number of unique words in Hamlet using only 100 elements.


I wish this was a joke...


> The trick, he said, is to rely on randomization.

> When the space is full, press pause and flip a coin for each word. Heads, and the word stays on the list; tails, and you delete it.

I wasn't expecting to go that far: randomization. How can you verify if the answer is good? Only approximation, maybe..


The proof is pretty straightforward and is included in the paper https://arxiv.org/abs/2301.10191


Yes, the result is an estimation.


Thanks. Just wanted to be sure I didn't misunderstood.


Did you read the (entire) article?


While the algorithm is interesting and I commend the authors for the algorithm, it should be noted that this is a guesstimate only. So, it's not really an efficient way to count distinct values but an efficient way to get an approximate count of distinct values.


Feels like reservoir sampling to me


I took a crack at implementing this in Go. For anyone curious I settled for algorithm 2 as I can just use a map as the base set structure.

https://github.com/tristanisham/f0


Does finding the number of unique elements in a set actually require comparison of each element with everything else? Can't you use a hashtable? For every element, add it to the table (ignore if already exists), and finally, take a count of keys.


> Does finding the number of unique elements in a set

That's easy, that's all of them. Sorry, could not resist.

Yes, hashing is the usual method. In a sorted list you can compare to the following element.


Imagine 1PB of data and you expect 30% of it to be unique. That needs 300TB RAM to store unique elements. Keep in mind the values in a hash table are the elements themselves, so a hastable of perhaps 300TB. Doing that without that much RAM, even swapping to disk can be tough.


Using a hashtable is the "normal" approach mentioned in the article. It works of course, but requires memory to store each unique element (or their hashes). If you have less memory available, the described algorithm can still give a very good approximation.


Using a hashtable is effective because you only compare elements within their hash buckets, not the entire set. However, they can become inefficient with very large datasets due to memory usage and processing time, which is where approximate counts shine.


This algorithm is still spinning a lot of random. I would guess that this is much less overhead than hashing but still seems like it could be significant.


That is fine when you have say 1 million values and only 1000 are unique. But when you have 1 million values and about 900 thousand are unique you are putting more or less the whole data set into memory.


Imagine a million elements. How big must your hashtable be ? The article explains it very well, did you miss it ? It's a way to save memory.

But to be honest I implemented it, ran it on Hamlet, and it's very wrong, it's barely useful but maybe if you just need a vague idea...


How big was your thresh? I found it pretty accurate: https://news.ycombinator.com/item?id=40388878


This actually comes at a good time. I'm currently refactoring a system that counts visits using inserts into a table, with a tuple of date and IP. I was planning to replace it with a HLL approach, but this is really interesting.


It is quite amazing that after we had so many talented researchers working for decades on this problem, the authors were able to come up with such an elegant and simple solution. Another proof that research is very necessary and useful.


Wondering if this approach could I found myself be applied to CFD (computational flow dynamics) methods to reduce the total volume of data points while still getting approximately close to the correct final answer?


Can some native speaker tell me: is "count" the new "guess"?


In the post-GPT world? Yes. Wait, maybe actually no? Depends on context. https://www.reddit.com/r/ChatGPT/comments/16jvl4x/wait_actua...


It's estimation, not counting.



Am I correct in thinking the accuracy of this method has more to do with the distribution in the sample than the algo itself. Meaning, given a 100% unique list of items, how far off would this estimate be?


It doesn't appear to be heavily dependent on the distribution. It mostly depends on how large your buffer is compared to the number of unique items in the list. If the buffer is the same size or larger, it would be exact. If it's half the size needed to hold all the unique items, you drop O(1 bit) of precision; at one quarter, O(2 bits), etc.


I don't think so. I think the point of the removal in

  X <- X \ {a_i}
  With probability p, X <- X U {a_i}
is that afterwards a_i is in X with probability p regardless of how many times it occurs in the stream.


I wonder how the error changes if you start a new round just before finishing the stream, compared to if the last word in the stream just fills the buffer and ends a round.


very interesting solution. A perfect example of how even some very complicated problems can sometimes have simple solutions. Will definitely try to write an implementation of this.

The only minor downside to this is that it's obviously not deterministic since it depends on chance. But for many applications where the dataset is so big it doesn't fit in memory, that's probably just a tiny rounding error anyway.


With other count-distinct algorithms, you can do unions and intersections on the "sets". (theta sketches, even bloom filiters)

Can you do this with CVM?


Unfortunately, not, and that's an interesting open problem as other count-distinct algorithms don't work for "structured sets", while this one does.

https://dl.acm.org/doi/10.1145/3452021.3458333


I wish we had a theory connecting randomness with time/space complexity.

Intuitively I think the key to many computational limitations is making use of randomness.


Your intuition is correct. See https://en.wikipedia.org/wiki/BPP_(complexity)


Which books are in the background, I can't read their titles because picture is blurred too much?? Help please !!! :)


Can LLM’s invent a new simple algorithm like this which it has never seen before? I tried to induce ChatGPT to invent this algorithm, giving it hints along the way, but it came up with something else that I’m unsure is correct. Then again, most humans wouldn’t be able to do that either. But how intelligent is AI if it can’t invent anything truly novel and significant?


parallelism is an important consideration on modern systems. While the memory clearing step parallelizes, the inner loop seems hard to parallelize?

also, how do we combine two sketches? (another form of parallelism)


Do they assume any particular distribution of the items?

Otherwise Nassim Taleb would like a word with them.


No, and Taleb is not relevant for this.


It is if the distribution is extreme.


If Hamlet was 100000000 times the word Hamlet and 1 time the word pancake, it would still give an good estimate measurement - even if pancake gets 0.


"In a recent paper" - really? The paper first appeared in ESA 2022. The revised version (with some errors fixed) is from May 2023. To be fair, compared to the rate of progress in this area between Flajolet & Martin (1985) and the previous SOTA (Blasiok, 2018), a couple of years of sitting on this news is not a lot.


Sounds like this qualifies as "recent" to me.


This Nim program might clarify some things for someone.

    import std/[random, math, sets, times]; type F = float
    
    template uniqCEcvm*(it; xfm; k=100, alpha=0.05): untyped =
      ## Return (unbiased estim. cardinality of `it`, 1-alpha CI ebound*(1 +- e)).
      var x1 = initHashSet[type(xfm)](k); var x  = x1.addr
      var x2 = initHashSet[type(xfm)](k); var y  = x2.addr
      var p  = 1.0          # p always 2^-i means could ditch FP
      var n  = 0            #..RNG & arith for bit shifts/masks.
      let t0 = epochTime()
      for e in it:
        inc n   #; let e = xfm # to compute randState.next here
        x[].excl e          # 'd be nice to reduce to 1 hash by
        if rand(1.0) < p:   #..excl w/prob 1 - p but then cannot
          x[].incl e        #..add new keys.  p -> 0 anyway.
        while x[].len == k: # KnuthLoop not CVMIf guards against
          for e in x[]:     #..unlikely case of freeing nothing.
            if rand(1.0) < 0.5: y[].incl e
          p *= 0.5
          swap x, y; y[].clear # ↓ e.bound conservative by ~15X
      (int(x[].len.F/p), sqrt(12.0*ln(8.0*n.F/alpha)/k.F),
       (epochTime() - t0)*1e9/n.F)
    
    when isMainModule:  # Takes nItem/trial dupMask space nTrial
      when defined danger: randomize()
      import std/[strutils, os, strformat]
      let n   =  if paramCount()>0: parseInt(paramStr(1)) else: 100000
      let msk = (if paramCount()>1: parseInt(paramStr(2)) else: 0xFFFF).uint64
      let k   =  if paramCount()>2: parseInt(paramStr(3)) else: 512 # 32KiB
      let m   =  if paramCount()>3: parseInt(paramStr(4)) else: 35
      var ex  = initHashSet[uint64]()
      var xs  = newSeq[uint64](n)
      for j in 1..m:
        xs.setLen 0; ex.clear
        for i in 1..n: xs.add randState.next and msk
        for e in xs: ex.incl e
        let (apx, err, t) = uniqCEcvm(xs, uint64, k)
        let ap = apx.F; let an = ex.len.F
        echo ex.len," ",apx,&" eb: {err:.4f} |x-a|/x: {abs(ap-an)/an:.4f} {t:.1f}ns"
First few lines of laptop output:

    51160 49152 eb: 0.6235 |x-a|/x: 0.0392 10.5ns
    51300 51840 eb: 0.6235 |x-a|/x: 0.0105 10.7ns
    51250 55168 eb: 0.6235 |x-a|/x: 0.0764 10.9ns
    51388 52352 eb: 0.6235 |x-a|/x: 0.0188 10.5ns
Ditching the floating point RNG & arithmetic might be able to lower that time per element cost to more like 5 ns or maybe 2GiB/s for 8B ints. The toggle back & forth between old & new HashSet is just to re-use the same L1 CPU cache memory.

Note also that the error bound in the CVM paper seems very conservative. My guess is that it could maybe be tightened by 15+X (at the scale of those example numbers), but that might also require some kind of special function evaluation. Anyone have any ideas on that?

Also note that since this is just estimating unique identities, you can always just count unique 64-bit hashes of keys (or maybe even 32-bit hashes depending on your accuracy needs and cardinalities) if you care about the key space/key compares are slow.


Frankly, who can read this!? I am not sure what's worse, the multi-line comments spanning multiple lines of code, having multiple instructions on a single line, or the apparent disconnect between the pseudo-code of the article.


I would blame the majority of your criticism on the fact that HN is not the best place to read code. Also, syntax highlighting & basic familiarity with Nim helps.

His code is doing a few more things than necessary. The actual algorithm is inside the `uniqCEcvm` template. The `it` it receives is anything you can iterate over (a collection or an iterator). Multiple things in one line really only appear where they directly relate to the part at the beginning of the line.

The `when isMainModule` is Nim's way of Python's `if __name__ == "__main__"`. The entire part below that is really just a mini CL interface to bench different (random) examples. Final thing to note maybe, the last expression of a block (e.g. of the template here) will be returned.

And well, the style of comments is just personal preference of course. Whether you prefer to stay strictly below 80 cols or not, shrug.

I grant you that the usage of 2 sets + pointer access to them + swapping makes it harder to follow than necessary. But I assume the point of it was not on "how to write the simplest looking implementation of the algorithm as it appears in the paper". But rather to showcase a full implementation of a reasonably optimized version.

Here's a version (only the algorithm) following the paper directly:

    proc estimate[T](A: seq[T], ε, δ: float): float =
      let m = A.len
      var p = 1.0
      var thr = ceil(12.0 / ε^2 * log2(8*m.float / δ))
      var χ = initHashSet[T](thr.round.int)
      for i in 0 ..< m:
        χ.excl A[i]
        if rand(1.0) < p:
          χ.incl A[i]
        if χ.card.float >= thr:
          for el in toSeq(χ): # clean out ~half probabilistically
            if rand(1.0) < 0.5:
              χ.excl el
          p /= 2.0
          if χ.card.float >= thr:
            return -1.0
      result = χ.card.float / p


Thank you very much for having taken the time.. Your comments and function are both very helpful!


Could you pick 1000 randomly and repeat the process endlessly?


The title is misleading. This is about probabilistic counting, therefore about estimation. It is not clear if the efficiency benefit extends to exact counting. Counting and estimation are not the same concepts.



Obviously it wouldn’t


Which books are in background?


There is a big practicality problem I see with this algorithm. The thresh defined in the paper relies on the length of the stream. It seems to me that in a scenario where you have a big enough data set to desire a solution that doesn't just store every unique value you don't know the length. I did not make it through all the proofs but I believe they use the fact that the defined threshold has the length in it to prove error bounds. If I were to use this in a scenario where I need to know error bounds i would probably ballpark the length of my stream to estimate error bounds and then use the algorithm with a ballpark threshold depending on my systems memory.

Another practical thing is the "exception" if nothing is removed on line 6 in the original algorithm. This also seems needed for the proof but you would not want in production, though the chance of hitting it should be vanishingly small so maybe worth the gamble?

Here is my faithful interpretation of the algorithm. And then a re-interpretation with some "practical" improvements that almost certainly make the provability of the correctness impossible.

    func CountUnique(scanner *bufio.Scanner, epsilon float64, delta float64, m int) int {

    X := make(map[string]bool)
    p := 1.0
    thresh := int(math.Ceil((12 / (epsilon * epsilon)) \* math.Log(8*float64(m)/delta)))

    for scanner.Scan() {
        a := scanner.Text()
        delete(X, a)
        if rand.Float64() < p {
            X[a] = true
        }

        if len(X) == thresh {
            for key := range X {
                if rand.Float64() < 0.5 {
                    delete(X, key)
                }
            }
            p /= 2
            if len(X) == thresh {
                panic("Error")
            }
        }
    }

    return int(float64(len(X)) / p)
}

  func CountUnique2(scanner *bufio.Scanner, thresh int) int {

     //threshold passed in, based on system memory / estimates
    X := make(map[string]bool)
    p := 1.0

    for scanner.Scan() {
        a := scanner.Text()
        delete(X, a)
        if rand.Float64() < p {
            X[a] = true
        }

        if len(X) >= thresh {  // >= instead of == and remove the panic below
            for key := range X {
                if rand.Float64() < 0.5 {
                    delete(X, key)
                }
            }
            p /= 2
        }
    }

    return int(float64(len(X)) / p)
}

I tested it with Shakespeare's work. The actual unique word count is 71,595. With the second algorithm it is interesting to play with the threshold. Here are some examples.

threshold 1000 Mean Absolute Error: 2150.44 Root Mean Squared Error: 2758.33 Standard Deviation: 2732.61

threshold 2000 Mean Absolute Error: 1723.72 Root Mean Squared Error: 2212.74 Standard Deviation: 2199.39

threshold 10000 Mean Absolute Error: 442.76 Root Mean Squared Error: 556.74 Standard Deviation: 555.53

threshold 50000 Mean Absolute Error: 217.28 Root Mean Squared Error: 267.39 Standard Deviation: 262.84


If you don't know the length of the stream in advance, you can just calculate the margin of error when you're done, no?


You just solve for epsilon. That's what I did: https://news.ycombinator.com/user?id=cb321


At first find this paragraph confusing:

> Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

Why would you delete the word already on the list by flipping coins? Doesn't this reduce the accuracy by counting less words than expected? And will the word be added to the list later?

After thinking about it for a while and reading the paper, I've finally developed a good mental model for how this algorithm works as below, which should convince even a high schooler why the algorithm works:

1. You are given a streaming set of elements from [n] (a finite set of n distinct elements). Now let's give each element a random uniform real number in [0,1]. This helps us choose the elements we want: if we choose all elements with the number below 0.5, we get about half of the elements.

2. Assume for a moment we have unbounded memory. Now we maintain a table of already seen elements, and for each element we keep the last real number attached with that element: so if an element A appears three times with the number 0.3, 0.7, 0.6, we keep A with the number 0.6.

3. Our memory is bounded! So we keep only the elements below a threshold 2^-r, that is, 1, 0.5, 0.25, etc. So at first we will keep all elements, but when we don't have enough memory, we will need to filter existing elements according to the threshold. It's easy to see that when we decrease threshold, only elements already in memory will meet the criteria and continue to exist in memory. No other elements in the stream will be below threshold. Also note that whether an element is in memory depends only on its last occurence. It can exist in memory for a while, get dropped because a later element does not meet the threshold, and get back in.

4. We don't actually have a real number attached to every element! But we can pretend as if there is one. For each new element X from the stream, we replace the number in memory with its attached number, and we only care if its attached number is below 2^-r or not. If it is, it should be in our memory, and if it's not, it should be out. Once the number is in memory, it's a random number in [0,2^-r] and we care no further.

When increasing r, we only care about whether the number is in [0,2^{-r-1}]. This has exactly 1/2 probability. So each number has 1/2 probability of getting out, and 1/2 probability of being kept in memory.

5. Now it's easy to see that whether an element is in the memory depends solely on the real number attached to its last occurence. That is, each distinct element has exactly 2^-r probability of being in memory. 2^r multiplied by the number of elements in memory gives a good estimate of number of distinct elements.

They criticized previous approaches as relying on hash functions. A simplified version of the approach is as follows:

1. Make a hash function that maps distinct elements to different uniform real numbers. So all As compute to 0.3, all Bs compute to 0.7, etc.

2. Use that hash function to transform all elements. The same element maps to the same real number, and different elements map to different real numbers.

3. Keep a set of N smallest distinct real numbers. This works by inserting numbers from the stream to the set. It's easy to see that adding the same element multiple times has exactly no effect; it's either too big to be in the set or the same as an existing number in the set.

4. This gives the Nth smallest real number in the set as K. The number of distinct elements is approximated as N/K.

This algorithm is remarkable, but not in that it's efficient or it's new. Both algorithm using a hash function and algorithms without hash functions has been around and is optimal. I assume this algorithm is the first optimal algorithm without hash functions that is very easy to understand even by undergraduates.


I had the same problem with the same paragraph and still don’t quite get it.

Unfortunately I struggle to follow the detailed explanation you gave… since you seem to understand it… can you confirm that they really do mean to throw away the word in the list they just found?

Eg

ACT I SCENE Elsinore A platform before the castle FRANCISCO at his post. Enter to him

If buffer max is 16, I am supposed to randomly half it first?

Act Scene Elisnore Platform Before The His Post

Now what? The next words are “Bernado Bernado who’s there”


Yes, you're supposed to throw it away.

The key insight is that words should appear in your scratch space with equal probability, no matter how often they appear in the source text. If you have a scratch space of size one, then the sequence of "apple" x 16 + "banana" x 1 should have equal chances of of the scratch space containing [apple] or [banana] at the end of the sequence, at least averaged over all 17 permutations.

One way to achieve this result is to make the scratch space memory-free. Rather than think of it as "remove the word with probability x", think of it as "always remove the word, then re-add it with probability (1-x)".


Aaaah, remove and the re-add (maybe) makes a bit more sense..

So if it is not in the list you just add it, right? Actually is that right? Won’t the list fill up to the max again at some point like this?

So if so, I add Bernardo. Now the very next word is Bernardo so I remove the last Bernardo and maybe re-add it based on a 50% chance.


The way I'd describe it is:

- You have a buffer. You initially try to fit every unique word on there. - If the buffer gets full, you know that you can't fit all the unique words in there. So you decide to keep only a fraction, _p1_, of them. Run through the buffer, keep each value with probability _p1_. - Keep adding new words, again only with probability _p1_. - If the buffer gets full again, _p1_ was too large, so you choose a lower fraction, _p2_, retain existing elements only with probability _p2_/_p1_, and continue adding new words with probability _p2_. - Every time the buffer gets full, you lower the faction of words you try to retain.

The choice of using _pn_ = (1/2)^n is just easy for a computer, it only needs entire random bits at each step.

What I _don't_ get is how this is correct for counting unique words. If I have a text of 16384 unique words, then I can accept that each will be in the final list with the same probability. But if I take that list and repeat the last word 30000 times, then it becomes overwhelmingly plausible that that word is in the final list, even though I haven't changed the number of unique words. Maybe there is some statistical property that evens things out, but I couldn't see it from the article.


And I got it. When the algorithm sees a word that is already in the list, it discards the word in the list first. Then it adds the word again with the same probability as any other word. This ensures that only the last occurrence of each word can occur in the final list, so the final occurrence of each word are all in the final list with the same probability, and prior occurrences are always removed, if no earlier then when the next occurrence is seen.

If the input is known to be large, there is no reason to start by adding every element. It can treat the first round like any other, and start out with a _p0_ that is smaller than 1.


Now we can finally count how many unique UUIDs there really are! My memory was never enough for this task before.


nice read!


Knuth talks about this paper here - "The CVM Algorithm for Estimating Distinct Elements in Streams" - https://cs.stanford.edu/~knuth/papers/cvm-note.pdf


I have questions for Quanta Magazine -- this is probably a shot in the dark, but if anyone can ask them a few questions, here would be three of mine:

1. Have independent researchers evaluated the effectiveness of Quanta Magazine at their mission? [1]

2. In the case of this article, did the editors consider including visualization? If not, why not?

3. Does Quanta have interactive designers and/or data scientists on staff? How many? Why not more?

## Background & Biases

I'm trying to overcome my disappointment in Quanta by being curious about their mission, organization, and constraints. I am glad they exist, but sometimes your "closest allies" can be one's harshest critics.

I'm greatly troubled by the level of scientific, mathematical, and rational understanding among the denizens of the world. But for some reason, the state of science writing bothers me more. It would seem that I hold out hope that science writers could do better. This may be unfair, I admit.

Anyhow, rather than just bash Quanta for being, say, not as good at the best math blogs or YouTube channels (such as 3Blue1Brown), I really want to figure out (a) if I'm missing something; or (2) if they are actively trying to improve; and (3) what we can all learn from their experience.

[1] From https://www.quantamagazine.org/about/ : "Quanta Magazine is an editorially independent online publication launched by the Simons Foundation in 2012 to enhance public understanding of science. Why Quanta? Albert Einstein called photons “quanta of light.” Our goal is to “illuminate science.”"


This is not a complaint. For people that downvoted, may I ask why? To be clear, I'm not asking people to _explain_ the downvotes; I am only asking the people that actually downvoted.

Here are some guesses:

- Are my questions too harsh or unfair? Off-topic?

- Do you not like that I openly admit my priors?

- You hold it against me that I don't think science journalism is very good?

- Do you think Quanta is of high quality?

- ... unique? Not worth challenging?

- ... doing all it can?

- As a point of comparison, have you looked at how many IX people and data scientists other news organizations have?

- Am I expecting too much?

I'm genuinely curious. Persuade me that I'm missing something. I'm open to it.

For what it is worth, I've attended many events associated with the Simons Institute; I've never found an organization like it. But I don't hold any organization up as above criticism. I have a feeling that Simons could be doing much more with Quanta.

To me, the downvotes without comments (above) are likely informative, but not for the reasons you might think. They suggests people want to express disapproval, but think a response is not warranted or would take too much time. But why?

My hunch is that I've struck a nerve. When people are uncomfortable, they often downvote. Sacred cows. Confirmation bias. Both are strong, even when you know that these things are cognitive errors. All in all, the downvotes may well suggest that people have huge irrational responses and/or blind spots around this.


> Imagine that you’re sent to a pristine rainforest to carry out a wildlife census.

alt-f4


CS guys always wanting to throw away a good system and start from scratch.


Which good system are you referring to?


Which good system are we talking about ?


Tell me you didn't read the article without telling me you didn't read the article.

...but actually, I think you didn't even read the headline...?


Yeah, if only we have stuck with those stone wheels from 20,000 B.C.


Well, they are easier to work on.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: