Computer scientists invent an efficient new way to count

zero_k · 2024-05-17T18:25:16 1715970316

I was involved with implementing the DNF volume counting version of this with the authors. You can see my blog post of it here:

https://www.msoos.org/2023/09/pepin-our-probabilistic-approx...

And the code here: https://github.com/meelgroup/pepin

Often, 30% of the time is spent in IO of reading the file, that's how incredibly fast this algorithm is. Crazy stuff.

BTW, Knuth contributed to the algo, Knuths' notes: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

He actually took time off (a whole month) from TAOCP to do this. Also, he is exactly as crazy good as you'd imagine. Just mind-blowing.

Hnrobert42 · 2024-05-17T19:01:26 1715972486

That’s really interesting and thanks for sharing.

I am very curious about the extraordinarily gifted. What made you think Knuth is crazy good? Was there a particularly moment? Was it how fast he groked ideas? Was it his ability to ELI5?

zero_k · 2024-05-17T20:23:48 1715977428

What made me realize is that I saw some snippets of emails he wrote to a colleague. It was... insane. You could see his mind race. He recognized patterns in minutes that would take me days, if not weeks, to recognize. Also, he actually writes and runs code, overnight if need be. It was as bit of a shock to me. He's not in an ivory tower. He's very much hands on, and when he's behind the wheel, you're in for a ride.

almostgotcaught · 2024-05-17T20:52:49 1715979169

> He recognized patterns in minutes that would take me days, if not weeks, to recognize... he actually writes and runs code, overnight if need be

70-80 years of actually being hands-on and i bet you'd be pretty quick too. dude is definitely naturally "gifted" but it seems pretty obvious being hands-on has a lot to do with it.

MaxBarraclough · 2024-05-18T10:37:27 1716028647

Disagree, there are thousands of highly experienced and hard-working computer scientists. If we grant that very few of them are the equivalent of Knuth, there must be something else at play.

fnordpiglet · 2024-05-18T20:15:38 1716063338

Most computer scientists I know at that age don’t touch a computer any more and hang with grand kids. That’s not a value judgement - Knuth is impressive but as a human being most people choose their humanity over their careers in some way. Beyond being simply smart and productive knuth is also likely obsessive about his work and his life is warped around it. As long as that works for everyone that’s great. But most people don’t live that life.

MaxBarraclough · 2024-05-18T21:44:38 1716068678

> Most computer scientists I know at that age don’t touch a computer any more and hang with grand kids.

That doesn't sound right to me at all. Modern academia is highly competitive, and career academics typically have long working hours.

If they aren't doing programming, that's likely because it isn't relevant to their job. A theoretical computer scientist is closer to a mathematician than to a typical programmer.

> Knuth is impressive but as a human being most people choose their humanity over their careers in some way

We're talking about scientists, not most people.

Getting a PhD is no cakewalk, and there are far more PhDs than faculty positions. You can try being a workaholic, but if your competitors are doing the same, that won't make you stand out.

> obsessive about his work and his life is warped around it

Again this describes every modern scientist. Deep knowledge of one's field, and deep commitment to it, are just table stakes.

fnordpiglet · 2024-05-19T02:06:04 1716084364

This doesn’t describe post tenure academia. You seem to be describing the life of a young tenured track academic.

Additionally while Knuth is clearly an outlier by any measure he’s also an outlier in his celebrity. There are a lot Knuths out there who aren’t well known outside their specialty, or are in industry. He played a seminal role in a field everyone studies in computer science and published a uniquely interesting and continuously revised set of fundamental books in the field. However in my time in academics there were people in say transactional memory for speculative out of order compute whose work powers every machine in use today and they still contribute similarly powerful work. They’re obsessive and very driven by the problem space. But for everyone one of those in academia there are a hundred tenured professors who paper mill their undergrads (generously).

You mention long hours but I said obsessive. That’s orders of magnitude more than working hard. It’s so distorted as to be pathological if they weren’t paid and rewarded for it. Yes many academics are pathologically obsessive. But unless they are bringing in funding or repute to fill a deficit in the department, there’s no work for them in current academic settings.

Finally Knuth isn’t a common occurrence because -he doesn’t bring in money-. Modern academia is oriented towards grant milking. The example of the txn memory guy is interesting because he brings in lots of research funding from intel and ARM and NVidia because his work is very commercial. Knuth - not so much I imagine. He brings repute, but you can only find so much repute with modern academic funding models before they’re a net negative on the department. Knuth is a fossil of a different era in academics (not used as a pejorative).

almostgotcaught · 2024-05-19T03:39:53 1716089993

> That doesn't sound right to me at all. Modern academia is highly competitive, and career academics typically have long working hours. > If they aren't doing programming, that's likely because it isn't relevant to their job. A theoretical computer scientist is closer to a mathematician than to a typical programmer.

This kind of stuff is not useful to be posting where impressionable people (young students) can read it. The truth the majority of academics are managers and delegate hands-on work to postdocs and PhD students. I finished a PhD just last month and I never saw in 4 years anyone on my committee so much as look at code let alone write it (and I was not a theory student). Almost everyone in my cohort would echo this observation.

gspetr · 2024-05-17T21:10:18 1715980218

Experience and age have diminishing returns.

Biden's been hands-on in his domain for over 50 years, yet "quick" is definitely not the word that comes to most people's mind when they think of him nowadays.

jedberg · 2024-05-17T21:44:58 1715982298

Actually quick is definitely something that comes to mind. Quick in politics is of course relative, but the speed with which he has enacted major changes (for example marijuana legalization) is pretty quick in the realm of politics when congress is of the other party.

willy_k · 2024-05-18T18:35:55 1716057355

He’s barely able to read a teleprompter, not too confident that Biden himself enacted those changes.

modriano · 2024-05-19T06:18:52 1716099532

We've had nearly 4 years with no scandals and emerged from the pandemic with the best economic recovery of any country, and despite having no margin to spare in Congress, master legislator Joe Biden has secured massive climate change, infrastructure, and gun control bills, not to mention he's ended our two decade war in Afghanistan and overseen the fastest wage growth of the two lowest income quintiles seen in modern history.

And every time people actually watch him speak (not just a selected clip), there's weeks of coverage about how alive JB seems, not recognizing that all evidence points to that being typical.

patrick451 · 2024-05-20T02:57:10 1716173830

> We've had nearly 4 years with no scandals

This isn't the flex you think it is. When the media is lapping out of your hand like a 6 week old puppy instead of doing their fifth estate job, of course there are no scandals.

> and gun control bills,

You mean stripping Americans of their constitutional rights.

> not to mention he's ended our two decade war in Afghanistan.

Which was an unmitigated disaster.

> and overseen the fastest wage growth of the two lowest income quintiles seen in modern history.

Hello inflation.

throw10920 · 2024-05-22T01:50:47 1716342647

>> and gun control bills

> You mean stripping Americans of their constitutional rights.

Not that I disagree with you, but when posters like modriano engage in political/partisan commentary on HN, I find it more productive to merely downvote and flag their comments rather than replying and getting engaged in a war.

A dead post makes quite the impression, as this sort of political commentary just generally defeats the quality of discourse on HN (which, you must admit, is much better than many other platforms, and I'd like to try to preserve it as long as possible).

throw10920 · 2024-05-18T01:46:05 1715996765

> Biden's been hands-on in his domain for over 50 years, yet "quick" is definitely not the word that comes to most people's mind when they think of him nowadays.

Please don't post flamebaity political tangents on HN.

> Eschew flamebait. Avoid generic tangents.

https://news.ycombinator.com/newsguidelines.html

mc32 · 2024-05-18T16:10:31 1716048631

It’s a counterpoint anyone can identify with. One can interpret it uncharitably as flame bait if one wants to, but it need not be. It could have been Reagan in his second term, but some may not know who he was. Or Lee Smolin.

throw10920 · 2024-05-18T16:19:05 1716049145

This is objectively false. It is not a counterpoint, because it's not an argument. It's an extremely subjective claim that is highly contentious (like jedberg's sibling comment[1]), definitely not something that "anyone can identify with" (as the vast majority of people do not know Joe Biden and instead view him through one of a small number of extremely skewed lenses) and clearly in the realm of "off-topic flamebait" that is not appropriate on HN.

[1] https://news.ycombinator.com/item?id=40394477

fnordpiglet · 2024-05-18T20:22:57 1716063777

I’d avoid the Reagan or other comparison as well even if I have medical evidence for their decline as you see with Reagan. In this specific case of Biden there’s not even that so it’s purely a political opinion and it’s definitely bait for flame even if not intended as such.

devnonymous · 2024-05-17T21:53:14 1715982794

> Also, he actually writes and runs code, overnight if need be...He's not in an ivory tower. He's very much hands on, and when he's behind the wheel, you're in for a ride.

That's such an admirable thing. Something to aspire to, given his age. I wonder if one can say the same thing of all the 'thought leaders' of the industry who go around pontificating about code hygiene and tidiness.

PartiallyTyped · 2024-05-17T20:43:01 1715978581

I feel extremely jealous of you.

khazhoux · 2024-05-18T06:01:09 1716012069

You are envious of him.

Jealous is when you possess something you don’t want taken away by someone else

hiAndrewQuinn · 2024-05-18T07:46:55 1716018415

Well, that use of jealousy is only really common in certain romantic situations. Like if some super good looking dude hits on your girl and she responds in an ambiguously-flirty way, you might definitely be said to be jealous, even though she didn't run off with him.

In most other domains, though, like this one, jealousy and envy are synonyms. https://www.merriam-webster.com/dictionary/jealousy#did-you-...

sowbug · 2024-05-19T02:21:00 1716085260

I read that link as supporting the distinction, not refuting it: "It is difficult to make the case, based on the evidence of usage that we have, [that they are] exact synonyms [or] totally different words."

An envious person would be happier having something that someone else also has. A jealous person feels threatened that someone else has or wants something. This distinction applies in romantic and non-romantic contexts. Both emotions can arise, for example, when someone observes someone else wanting something neither of them has, or when the thing wanted is inherently exclusive.

I don't lose a lot of sleep worrying about others' use or misuse of these words. But I do think it's essential for people to understand which of the two emotions they're having, because the solution really depends on that. Would I be happy if I destroyed that other kid's toy? (That's jealousy.) Would I be happy if my dad showed up and gave me a similar toy, so that the other kid and I could play together? (That's envy.)

theo1996 · 2024-05-31T15:50:54 1717170654

This looks dumb....very dumb, am I missing something? This is not counting, its just sampling AND if you want to actually count all the distinct words the memory size used doesnt change in comparison to just counting.

sesteel · 2024-05-17T21:37:11 1715981831

Maybe you'd know, but why would one choose to not sort favoring larger counts and drop the bottom half when full? It may be obvious to others, but I'd be curious.

zero_k · 2024-05-17T22:37:31 1715985451

The guarantees would not hold, I'm pretty sure ;) Maybe one of the authors could chip in, but my hunch is that with that you could actually introduce arbitrarily large errors. The beauty of this algorithm really is its simplicity. Of course, simple is.. not always easy. This absolute masterpiece by Knuth should demonstrate this quite well:

https://www.sciencedirect.com/science/article/pii/0022000078...

It's an absolutely trivial algorithm. Its average-case analysis is ridiculously hard. Hence why I think this whole Ordo obsessions needs to be refined -- worst case complexity has often little to do with real-world behavior.

PeterisP · 2024-05-18T08:48:11 1716022091

Worst case complexity matters when the input data can be manipulated by someone malicious, who can then intentionally engineer the degenerate worst case to happen - as we have seen historically in e.g. denial of service attacks exploiting common hash table implementations with bad worst case complexity.

Out_of_Characte · 2024-05-18T17:10:34 1716052234

No, you're throwing away a random selection of 50/50. You would have to flood the algorithm with uniques or commons to set the algoritm to a probability of a known state.

lokar · 2024-05-18T01:16:25 1715994985

You want every distinct item to have the same chance at the end. So when items repeat you need to reduce (not increase) the odds of keeping any given occurrence.

throwaway14356 · 2024-05-18T04:57:50 1716008270

does that mean you could also split the set in half multiple times then run it on each half of a half (etc) and combine it with its other half?

that would seem simpler to me.

edit: oh but then you would need to keep the results which defeats the purpose

lokar · 2024-05-19T00:45:47 1716079547

You would need to assume a uniform distribution of items, which I don’t think this does

sufiyan · 2024-05-18T08:44:21 1716021861

Let’s prove it by contradiction: Lets say you pick the larger ones and drop the smaller ones every single round, you have lost the probabilistic guarantee of 1/2^k that the authors show because the most frequent words will be the lost frequent in subsequent rounds as well. This is the intuition, the math might be more illuminating.

cma · 2024-05-18T13:57:33 1716040653

What are the main applications of this?

jandrese · 2024-05-17T20:38:06 1715978286

So now we have you to blame for a delay on the release of his next book. :)

zero_k · 2024-05-17T20:46:51 1715978811

Not me, the authors -- I'm a fly on the wall compared to them ;) This is some serious work, I just did a fast implementation. Re implementation -- it turns out that there are parts of some standard libraries that this problem pushes to its limits, that we had to go around during implementation. So there were still some cool challenges involved. I was also pretty happy about the late binding/lazy evaluation thing I came up with. Of course Knuth just did it (check his notes), without even thinking about it :D What is an achievement for me is a lazy Monday coffee for him, but oh well!

kuldeepmeel · 2024-05-18T22:20:40 1716070840

I agree with zero_k on everything he said about Knuth and strongly disagree with his own (extremely modest) characterization of himself.

zvr · 2024-05-29T22:21:46 1717021306

I am pretty confident that this was not the only problem/algorithm that "distracted" Knuth during the years. You can see that whenever he encounters interesting issues he has no problem pausing the work on TAOCP and pursue other goals.

pixelmonkey · 2024-05-17T06:50:43 1715928643

This algorithm seems to resemble HyperLogLog (and all its variants), which is also cited in the research paper. Using the same insight of the estimation value of tracking whether we've hit a "run" of heads or tails, but flipping the idea on its head (heh), it leads to the simpler algorithm described, which is about discarding memorized values on the basis of runs of heads/tails.

This also works especially well (that is, efficiently) in the streaming case, allowing you to keep something resembling a "counter" for the distinct elements, albeit with a error rate.

The benefit of HyperLogLog is that it behaves similarly to a hash set in some respects -- you can add items, count distinct them, and, importantly, merge two HLLs together (union), all the while keeping memory fixed to mere kilobytes even for billion-item sets. In distributed data stores, this is the trick behind Elasticsearch/OpenSearch cardinality agg, as well as behind Redis/Redict with its PFADD/PFMERGE/PFCOUNT.

I am not exactly sure how this CVM algorithm compares to HLL, but they got Knuth to review it, and they claim an undergrad can implement it easily, so it must be pretty good!

hmottestad · 2024-05-17T08:15:45 1715933745

It’s also possible to use HLL to estimate the cardinality of joins since it’s possible to estimate both the union and the intersection of two HLLs.

http://oertl.github.io/hyperloglog-sketch-estimation-paper/

j-pb · 2024-05-17T15:53:13 1715961193

It's a really interesting open problem to get the cost of these down so that they can be used to heuristically select the variable order for worst case optimal joins during evaluation.

It's somewhere on the back of my todo list, and I have the hunch that it would enable instance optimal join algorithms.

I've dubbed these the Atreides Family of Joins:

  - Jessicas Join: The cost of each variable is based on the smallest number of rows that might be proposed for that variable by each joined relation.
  - Pauls join: The cost of each variable is based on the smallest number of distinct values that will actually be proposed for that variable from each joined relation.
  - Letos join: The cost of each variable is based on the actual size of the intersection.

In a sense each of the variants can look further into the future.

I'm using the first and the second in a triplestore I build in Rust [1] and it's a lot faster than Oxigraph. But I suspect that the constant factors would make the third infeasable (yet).

1: https://github.com/triblespace/tribles-rust/blob/master/src/...

refset · 2024-05-17T16:18:30 1715962710

Having read something vaguely related recently [0] I believe "Lookahead Information Passing" is the common term for this general idea. That paper discusses the use of bloom filters (not HLL) in the context of typical binary join trees.

> Letos join

God-Emperor Join has a nice ring to it.

[0] "Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis" - https://www.vldb.org/pvldb/vol16/p2962-zhang.pdf

j-pb · 2024-05-17T16:40:57 1715964057

Thanks for the interesting paper!

  We now formally define our _God-Emperor Join_ henceforth denoted join_ge...

Nice work with TXDB btw, it's funny how much impact Clojure, Datomic and Datascript had outside their own ecosystem!

Let me return the favour with an interesting paper [1] that should be especially relevant to the columnar data layout of TXDB. I'm currently building a succinct on-disk format with it [2], but you might be able to simply add some auxiliary structures to your arrow columns instead.

1: https://aidanhogan.com/docs/ring-graph-wco.pdf

2: https://github.com/triblespace/tribles-rust/blob/archive/src...

refset · 2024-05-17T17:04:29 1715965469

> Nice work with TXDB btw

It's X.T. (as in 'Cross-Time' / https://xtdb.com), but thank you! :)

> 1: https://aidanhogan.com/docs/ring-graph-wco.pdf

Oh nice, I recall skimming this team's precursor paper "Worst-Case Optimal Graph Joins in Almost No Space" (2021) - seems like they've done a lot more work since though, so definitely looking forward to reading it:

> The conference version presented the ring in terms of the Burrows–Wheeler transform. We present a new formulation of the ring in terms of stable sorting on column databases, which we hope will be more accessible to a broader audience not familiar with text indexing

j-pb · 2024-05-17T17:24:44 1715966684

My apologies! It's even more emberrasing given the fact that I looked up the website, knowing that I _always_ swap them after having written too many `tx-listen` in my career.

They expanded their work to wider relations and made the whole framework a lot more penetrable. I think they over-complicate things a bit with their forward-extension, so I'm keeping every column twice (still much better than all permutations), which in turn allows for ad-hoc cardinality estimation.

Also the 1 based indexing with inclusive ranges is really not doing them any favours. Most formula become much more streamlined and simpler with 0 based indexing and exclusive ranges. (see `base_range` and `restrict_range` in [0])

0: https://github.com/triblespace/tribles-rust/blob/e3ad6f21cdc...

jalk · 2024-05-17T16:12:49 1715962369

Iirc intersection requires the HLLs to have similar cardinality, otherwise the result is way off.

snewman · 2024-05-18T05:21:31 1716009691

You could merge these data structures as well. If the two instances to be merged are not at the same "round", take the one that's at an earlier round and advance it (by discarding half the entries at random) by the difference in rounds. Then just insert the values from one list to the other, ignoring duplicates; if the result is too large, discard half at random and increment the round number.

I implemented exactly this algorithm at my previous employer, except that alongside each value, we stored an estimate of the number of times that value appeared. This allowed us to generate an approximate list of the most common values and estimated count for each value.

gwillen · 2024-05-19T22:58:29 1716159509

Merging like that doesn't work -- it will tend to overestimate the number of distinct elements.

This is fairly easy to see, if you consider a stream with some N distinct elements, with the same elements in both the first and second halves of the stream. Then, supposing that p is 0.5, the first instance will result in a set with about N/2 of the elements, and the second instance will also. But they won't be the same set; on average their overlap will be about N/4. So when you combine them, you will have about 3N/4 elements in the resulting set, but with p still 0.5, so you will estimate 3N/2 instead of N for the final answer.

I have a thought about how to fix this, but the error bounds end up very large, so I don't know that it's viable.

finnh · 2024-05-18T18:11:31 1716055891

Was there, reviewed the PR, can confirm. Hi Steve!

Since then we've also tuned it up in a couple ways, in particular adding "skip" logic similar to fast reservoir sampling to trade some accuracy for the ability to not even look at the next N {M,G,T}B if you've already seen many many many matches. For non-selective searches over PB of data it's a good tradeoff, despite introducing some search-order bias.

willvarfar · 2024-05-17T12:14:59 1715948099

Just curious, dusting off my distant school memories :)

How do the HLL and CVM that I hear about relate to reservoir sampling which I remember learning?

I once had a job at a hospital (back when 'whiz kids' were being hired by pretty much every business) where I used reservoir sampling to make small subsets of records that were stored on DAT tapes.

michaelmior · 2024-05-17T13:50:30 1715953830

I guess there is a connection in the sense that with reservoir sampling, each sample observed has an equal chance of remaining when you're done. However, if you have duplicates in your samples, traditional algorithms for reservoir sampling do not do anything special with duplicates. So you can end up with duplicates in your output with some probability.

I guess maybe it's more interesting to look at the other way. How is the set of samples you're left with at the end of CVM related to the set of samples you get with reservoir sampling?

_a_a_a_ · 2024-05-17T16:46:58 1715964418

Was wondering the same, thanks for an answer.

krackers · 2024-05-17T20:47:17 1715978837

Knuth's presentation of this [1] seems very very similar to the heap-version (top-k on a uniform deviate) of reservoir sampling as mentioned in [2]. The difference is in how duplicates are handled. I wouldn't be surprised if this algorithm was in fact already in use somewhere!

[1] https://cs.stanford.edu/~knuth/papers/cvm-note.pdf [2] https://florian.github.io/reservoir-sampling/

Edit: Another commenter [3] brought up the BJKST algorithm which seems to be similar procedure except using a suitably "uniform" hash function (pairwise independence) as the deviate instead of a random number.

[3] https://news.ycombinator.com/item?id=40389178

usgroup · 2024-05-17T07:56:33 1715932593

I found the paper took about as long to read as the blog post and is more informative:

https://arxiv.org/pdf/2301.10191

It is about estimating the cardinality of a set of elements derived from a stream. The algorithm is so simple, you can code it and play with it whilst you read the paper.

The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.

vanderZwan · 2024-05-17T08:42:04 1715935324

If you refer to the subtitle of the paper - An Algorithm for the (Text) Book - I think that is actually a reference to something *Paul Erdos allegedly said about some proofs are so elegant in their simplicity and beauty that they are "from The Book", like representing some divine Platonic ideal.

Given that Knuth himself reviewed it, he might have remarked that this was one of those algorithms! Perhaps the authors decided to include it in the title as a not-so-humble brag (which would be well-earned if that's the case!)

edit: originally this comment said Knuth was the one who said this about some algorithms being from The Book, but that was my faulty memory.

cschmidt · 2024-05-17T16:02:03 1715961723

I liked this part. They got Knuth to review it, and found mistakes. That's kind of cool, in its own way.

    We are deeply grateful to Donald E. Knuth for his thorough review, 
    which not only enhanced the quality of this paper (including fixing
    several errors) but has also inspired us for higher standards.

usgroup · 2024-05-17T10:22:38 1715941358

From the abstract: "... All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory ...."

dchftcs · 2024-05-17T12:32:28 1715949148

This is really twisting it, I don't find pairwise indepedence to be more advanced than applying a Chernoff bound. In this problem it'd mostly be the difference of using a Cherbyshev type bound or Chernoff bound.

Pairwise independence is to give the algorithm stronger guarantees and let it work with a simple hash function like ax+b, otherwise probably most existing algorithms can be simplified in the same way. The most similar algorithm is BJKST, which is almost identical except for specifyimg the sampling mechanism to require less randomness.

To someone who worked on this type of thing before, it's like seeing people familar with LLMs say "oh yet another X-billion parameter model on github doing more or less the same".

kuldeepmeel · 2024-05-18T19:17:36 1716059856

The Chernoff bound needed in this work can be derived from Binomial distribution (with Stirling's approximation);

I have worked on pairwise independent hash functions for a decade and every time I introduce such a function, it feels like magic. The notion of pairwise independence is easy to explain but the notion of pairwise independent hash functions isn't.

The other strength of our work is that it can work for general settings of sets for which pairwise independent hash functions are not known. Please see: https://dl.acm.org/doi/10.1145/3452021.3458333

Sharlin · 2024-05-17T12:21:32 1715948492

The point is that the subtitle's is pretty clearly intended to have a dual meaning, it wouldn't be phrased like that otherwise.

kibibu · 2024-05-17T08:58:11 1715936291

I thought The Book was an Erdos thing. I wonder who used it first.

stevesimmons · 2024-05-17T09:13:45 1715937225

"During a lecture in 1985, Erdős said, `You don't have to believe in God, but you should believe in The Book.`"

https://en.wikipedia.org/wiki/Proofs_from_THE_BOOK

vanderZwan · 2024-05-17T09:45:33 1715939133

I think you're right, I must have confused the two. I'll edit my comment to reduce the spread of misinformation.

resonious · 2024-05-17T11:17:22 1715944642

The blog post was more than half padding. Good that the algorithm is so simple it's hard to write a full length blog post about it!

mpalmer · 2024-05-18T15:22:58 1716045778

And yet the blog post still got it wrong:

> Now you move forward with what the team calls Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

It's not just removals you test with N coin flips in Round N, it's whether to insert the new item at all.

Paddy3118 · 2024-05-23T14:50:59 1716475859

I originally used Guttenburgh to get Hamlet and coded the Quanta method in Python and it did not work. I then moved to Algorithm 1 in the paper and got Copilot to (mis) convert it to Python and then spent time getting Copilot to admit its mistakes. The resultant code seemed to work but I found the Quanta suggested data of the words of hamlet to be uninspiring as for the calculated theta (max set size before halving), was often from ~50% of the total number of words in hamlet to often more than the words in hamlet. I've yet to investigate theta in more depth...

gwillen · 2024-05-19T22:59:47 1716159587

Yeah, I noticed the same thing. Quanta's version of the algorithm is not only confusing, it's also wrong.

I think the pseudocode in the paper is very hard to beat.

cb321 · 2024-05-17T12:36:17 1715949377

I agree the paper is better than the blog post, although one criticism I have of the CVM paper is that it has some termination/algo exit condition instead of what Knuth's CVM notes (refed else-thread here) do which is just a loop to ensure getting more space in the reservoir halving-step. It seems more work to explain the https://en.wikipedia.org/wiki/Up_tack than just do the loop. [1]

[1] https://news.ycombinator.com/item?id=40388878

kuldeepmeel · 2024-05-18T18:48:25 1716058105

You are indeed right; while has the added advantage of making the estimator unbiased -- i.e., not only strongly (epsilon,delta)-guarantees but also having an expectation of being correct).

It wasn't easy to see that loop would have added benefit -- that's where Knuth's ingenuity comes in.

imglorp · 2024-05-17T13:48:35 1715953715

On that note, I'm also unfamiliar with this \ operator notation which is used without explanation.

    X ← X \ {ai}

jtanderson · 2024-05-17T13:59:53 1715954393

That is conventional set subtraction notation. "Assign to X the same set minus all elements of the set {a_i}".

One example source, but it is pretty common in general: http://www.mathwords.com/s/set_subtraction.htm

rocqua · 2024-05-17T14:02:40 1715954560

Set difference.

Set X becomes X without element ai. This is the case whether ai was in the set X before the step was taken.

_a_a_a_ · 2024-05-17T16:49:23 1715964563

I've known that symbol for decades, never knew it's name - up-tack it is. Ta!

klabb3 · 2024-05-17T23:35:37 1715988937

It's been a while, and maybe my brain has smoothened since my time in CS, but man this looks more confusing than it needs to be.

First, the contradiction thing. It's just.. An error/panic, why? Anyway, fine.

Then, there's the whole premise of 1..m: I'm still not sure if the size needs to be known upfront or not. Looking at it a bit more, it seems like no you don't. You pick a threshold, and then depending on the size of the stream the probability changes. But the algorithm is described as if it had a single output, which is not the case(?).

And btw, I didn't know about the Chernoff bounds and delta/epsilon are not explained at all in the paper, which added to the confusion a lot.

Anyway, here's my take in Golang: https://github.com/betamos/distinct

I factored out the threshold parts into a helper instead, which makes a lot more sense than accidentally allocating too much memory.

Perhaps there should be a method for estimating the confidence/error rate. Nobody knows the size of a stream upfront, so it would make more sense to update these values as you go. Brain is not strong enough to implement it, but feel free to send a PR.

kuldeepmeel · 2024-05-18T07:43:45 1716018225

[I am one of the authors].

We have a follow-up work (admittedly, more technical) that can remove reliance on m completely: https://www.cs.toronto.edu/~meel/Papers/pods22.pdf

But yes, our theorems can be reworked to estimate the confidence/error rate; that's what Knuth did: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

klabb3 · 2024-05-18T11:59:23 1716033563

Didn’t realize you were here so let me be clear that I did overall find the paper so approachable that I could implement it with only a couple of outside pointers (also a little clever impl optimization around storing p if you’re curious). The above should be read more as “even this well-written simplified paper is not necessarily trivial to understand for practicians”. So more of a general point around academic obscurity.

> But yes, our theorems can be reworked to estimate the confidence/error rate

I think that’s useful for practical implications. Also, for practical use, how does one decide the tradeoff between delta and epsilon? Perhaps it’s covered elsewhere, but I have a hard time intuiting their relationship.

kuldeepmeel · 2024-05-18T19:04:16 1716059056

I fully agree with you and this is indeed one of my criticisms of modern academic writing -- we tend to write papers that are just very hard for anyone to read.

So delta refers to the confidence, i.e., how often are you willing to be wrong, and epsilon is tolerance with respect to the actual count.

We have found that in general, setting delta=0.1 and espilon=0.8 works fine for most practical applications.

usgroup · 2024-05-18T07:54:28 1716018868

> First, the contradiction thing. It's just.. An error/panic, why? Anyway, fine.

I suspect due to the details of the proof. This condition looks likely to me for very small set cardinality making the algorithm inappropriate for all-weather. See page 3 of the paper where Algorithm 2 is introduced. They show that in the failure condition, the likelihood of the algorithm returning a value outside of the epsilon bounds is higher.

> Then, there's the whole premise of 1..m: I'm still not sure if the size needs to be known upfront or not.

m sizes the threshold, if it is too small, the error bounds guaranteed by the algorithm will be larger than expected, and vice versa if m is too large.

> And btw, I didn't know about the Chernoff bounds and delta/epsilon are not explained at all in the paper, which added to the confusion a lot.

Papers typically do not spend words on basics and those are standard concepts.

> Perhaps there should be a method for estimating the confidence/error rate.

You can't resize the stream easily because p implicitly depends on m via the thresh cardinality condition and if you were to change m then your p updates would not correspond. As a result you may not be able to rely on the error bounds. Note though that stream doesn't mean infinite or very large: take it to mean one item at a time. The point of this algorithm is to bound space complexity to something small. Have a look at thresh and note that log2(1e50) is just 166: if you did have a very large stream of indeterminate length you could also just pick a very large m.

swores · 2024-05-17T09:25:00 1715937900

> "The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks."

If you're saying it's just for "undergraduates and textbooks", as opposed to just being simple enough for them to use but not limited to them, would you mind explaining what makes it useful for undergrads but not for professionals?

pfsalter · 2024-05-17T10:16:52 1715941012

My interpretation from the paper is that this algorithm is simpler than other options but also worse. So in a professional context you'd use one of those instead

usgroup · 2024-05-17T10:25:01 1715941501

From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."

swores · 2024-05-17T10:28:34 1715941714

That still only speaks to it being simple enough for students, not whether its too simple for any other use vs. useful enough that students who learn it will spend the rest of their lives using it.

For example word processor software is commonly described as simple enough for children to use at school, that doesn't mean that word processor software is of no use to adults.

aidenn0 · 2024-05-17T21:53:38 1715982818

Simplicity in an algorithm has limited direct impact on its usefulness in industry where libraries are prevalent.

Consider mapping structures with Θ(k) expected lookup times. The simplest implementation is a hash table with chaining. Open-addressing is a bit more complicated, but also more common. Tries, which have O(k) worst-case lookup times are often covered in Undergraduate courses and are definitely easier to analyze and implement than forms of open-addressed hash-tables that have O(k) guarantees (e.g. Cuckoo hashing).

adgjlsfhk1 · 2024-05-17T13:40:00 1715953200

the reason it's too simple for most real world use is that hyper-log-log is the "good" version of this technique (but is harder to prove that it works)

coldtea · 2024-05-17T09:15:22 1715937322

>The authors are explicit about the target audience and purpose for the algorithm: undergraduates and textbooks.

Doesn't seem like it. Seems like an algorithm (similar to other approximate cardinality estimation algorithms) with huge applicability.

usgroup · 2024-05-17T10:28:05 1715941685

From the abstract: "All the current state-of-the-art algorithms are, however, beyond the reach of an undergraduate textbook owing to their reliance on the usage of notions such as pairwise independence and universal hash functions. We present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with the knowledge of basic probability theory."

coldtea · 2024-05-17T12:58:17 1715950697

That just says that this is also simpler and more accessible algorithm, suitable even for undegraduate textbooks.

Not that this is just useful for textbooks, a mere academic toy example, which would be something else entirely.

This is both accessible AND efficient.

aspenmayer · 2024-05-17T12:11:46 1715947906

Link to abstract: https://arxiv.org/abs/2301.10191

jtanderson · 2024-05-17T13:52:34 1715953954

Given the topic of the paper[0], the footnote is especially charming:

> the authors decided to forgo the old convention of alphabetical ordering of authors in favor of a randomized ordering, denoted by r⃝. The publicly verifiable record of the randomization is available at https://www.aeaweb.org/journals/policies/random-author-order...

[0]: https://arxiv.org/pdf/2301.10191

edit: formatting

pytness · 2024-05-17T13:24:11 1715952251

Is it me or is the description of the algo wrong?

    > Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word;

If i follow this description of "check if exists in list -> delete":

    if hash_set.contains(word) {
        if !keep_a_word(round) {
            hash_set.remove(word);
            continue;
        }
    } else {
        hash_set.insert(word.to_string());
    }

The algorithm runs for ~20 iterations:

    Total word count: 31955 | limit: 1000
    End Round: 20, word count: 737
    Unique word count: 7233
    Estimated unique word count: 772800512

But if I save the word first and then delete the same word:

    hash_set.insert(word.to_string());

    if !keep_a_word(round) {
        hash_set.remove(word);
        continue;
    }

It gets the correct answer:

    Total word count: 31955 | 1000
    End Round: 3, word count: 905
    Unique word count: 7233
    Estimated unique word count: 7240

josephernest · 2024-05-17T13:43:21 1715953401

I got the same problem.

When implementing the exact method as described in quanta magazine (without looking at the arxiv paper), I always had estimates like 461746372167462146216468796214962164.

Then after reading the arxiv paper, I got the the correct estimate, with this code (very close to mudiadamz's comment solution):

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem =  set()  
    for k in L:
        if k in mem:
            mem.remove(k)
        if np.random.rand() < p:
            mem.add(k)
        if len(mem) == thresh:
            mem = {m for m in mem if np.random.rand() < 0.5}
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Or equivalently:

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem = []
    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Now I found the quanta magazine formulation problem. By reading:

> Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

we want to write:

    for k in L:
        if k not in mem:
            mem += [k]
        else:
            if np.random.rand() > p:
                mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2

whereas it should be (correct):

    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:    # without the else
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2

Just this little "else" made it wrong!

kuldeepmeel · 2024-05-18T07:52:20 1716018740

Yes, there is an error in the Quanta article [at the same time, I must add that writing popular science articles is very hard, so it would be wrong to blame them]

Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing: mem = [m for m in mem if np.random.rand() < 0.5]

["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]

Alexanfa · 2024-05-17T13:51:07 1715953867

Quanta:

    Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

To:

    Round 1. Keep going through Hamlet, but now flipping a coin for each word. If it’s tails, delete the word if it exists; heads, and add the word  if it's not already on the list.

Old edit:

    Round 1. Keep going through Hamlet, adding words but now flipping a coin immediately after adding it. If it’s tails, delete the word; heads, and the word stays on the list.

josephernest · 2024-05-17T14:03:20 1715954600

> adding words but now flipping a coin immediately after adding it

Edit: I thought your formulation was correct but not really:

We flip the coin after adding, but we also flip the coin even if we didn't add the word (because it was already there). This is subtle!

wrong:

    if k not in mem:
        mem += [k]
        if np.random.rand() > p:
            mem.remove(k)

wrong:

    if k not in mem:
        mem += [k]
    else:
        if np.random.rand() > p:
            mem.remove(k)

correct:

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)

correct:

    if k not in mem:
        mem += [k]
    if np.random.rand() > p:
        mem.remove(k)

kuldeepmeel · 2024-05-18T07:46:55 1716018415

The following is also not correct.

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)

Your final solution is indeed correct, and I think more elegant than what we had in our paper [I am one of the authors].

Alexanfa · 2024-05-17T14:21:36 1715955696

Ah, I'm using a set instead of list so I just always add and then toss remove.

Alexanfa · 2024-05-17T13:40:37 1715953237

Was just now solving it and came to see if others had the same issue. Yep, you are right.

    function generateRandomNumbers(c, n) {
      let randomNumbers = new Array(c);
      for (let i = 0; i < randomNumbers.length; i++) {
        let randomNumber = Math.floor(Math.random() * (n + 1));
        randomNumbers[i] = randomNumber;
      }
      return randomNumbers;
    }
    function run(w, wS, m, r) {
        function round(r) {
            while(wS.size < m) {
                const next = w.next()
                if (next.done) return true;
                wS.add(next.value)
                prune(next.value, r)
            }
            return false
        }
        function prune(v,r) {
            for (let i = 0; i < r; i++) {
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(v)
                }
            }
        }
        function purge(wS) {
            const copy = new Set(wS)
            copy.forEach(ith=>{
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(ith)
                }
            })
        }
        const done = round(r);
        if (!done) {
            purge(wS)
            return run(w, wS, r+1,m)
        }
        console.log(`Round ${r} done. ${wS.size} Estimate: ${wS.size / (1/Math.pow(2,r))}`)
    }
    const memory = 1000
    const words = generateRandomNumbers(3000000,15000)
    const w = words[Symbol.iterator]() // create an iterator
    const wS = new Set();
    run(w,wS, memory,0);

Alexanfa · 2024-05-17T16:51:57 1715964717

Noticed an error;

    return run(w, wS, r+1,m)

Should be changed to:

    return run(w, wS, m, r+1)

melq · 2024-05-17T08:03:45 1715933025

Estimating the amount of unique elements in a set and counting the amount of unique elements in a set are very different things. Cool method, bad headline.

jameshart · 2024-05-17T12:35:33 1715949333

They’re not very different things; the terms are used interchangeably in most contexts because in the real world all counting methods have some nonzero error rate.

We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

I kind of think the mythology of the ‘countless stones’ (https://en.wikipedia.org/wiki/Countless_stones) is a sort of folk-reminder that you can never be too certain that you counted something right. Even something as big and solid and static as a standing stone.

The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

samatman · 2024-05-17T18:04:26 1715969066

> the terms are used interchangeably in most contexts

Counting and estimating are not used interchangeably in most contexts.

> because in the real world all counting methods have some nonzero error rate.

The possibility that the counting process may be defective does not make it an estimation.

> We talk about ‘counting votes’ in elections, for example, yet when things are close we perform ‘recounts’ which we fully expect can produce slightly different numbers than the original count.

We talk about counting votes in elections because votes are counted. The fact that the process isn't perfect is a defect; this does not make it estimation.

> That means that vote counting is actually vote estimating, and recounting is just estimating with a tighter error bound.

No. Exit polling is estimation. Vote counting is counting. Vote recounting is also counting, and does not necessarily impose a tighter error bound, nor necessarily derive a different number.

> The situations where counting is not estimating are limited to the mathematical, where you can assure yourself of exhaustively never missing any item or ever mistaking one thing’s identity for another’s.

So like, computers? Regardless, this is wrong. Estimating something and counting it are not the same thing. Estimation has uncertainty, counting may have error.

This is like saying addition estimates a sum because you might get it wrong. It's just not true.

jameshart · 2024-05-17T18:35:35 1715970935

So, IEEE floating point doesn’t support ‘addition’ then.

samatman · 2024-05-17T19:03:13 1715972593

IEEE 754 defines an exact binary result for the addition of any two floats.

That this bit-identical result is not the same operation as addition of real numbers is irrelevant, because floats aren't reals.

f1 + f2 is not an estimation. Even treating it as an approximation will get you into trouble. It's not that either, it's a floating-point result, and algorithms making heavy use of floating point had better understand precisely what f1 + f2 is going to give you if they want to obtain maximum precision and accuracy.

aidenn0 · 2024-05-17T21:57:59 1715983079

Cool, so next time I have numbers that aren't reals to perform math on, I can use floats.

samatman · 2024-05-18T20:13:08 1716063188

Or if you have numbers that aren't integers to perform math on, you can use integers.

It's not a new problem, and it isn't specific to floats. Computers do discrete math. Always have, always will.

lupire · 2024-05-17T12:39:40 1715949580

Come on. There is a fundamental difference between trying to get an exactly answer and not trying to get an exactly correct answer.

jameshart · 2024-05-17T12:55:15 1715950515

It’s not a fundamental difference, it’s a fundamental constraint.

There are circumstances - and in real life those circumstances are very common - where you must accept that getting an exactly correct answer is not realistic. Yet nonetheless you want to ‘count’ things anyway.

We still call procedures for counting things under those circumstances ‘counting’.

The constraints on this problem (insufficient memory to remember all the unique items you encounter) are one such situation where even computerized counting isn’t going to be exact.

Levitz · 2024-05-17T15:19:36 1715959176

I agree with you, but we are talking theory here. The algorithm doesn't count, it estimates.

You can make an algorithm that counts, you can make an algorithm that estimates, this is the second.

sangnoir · 2024-05-17T17:07:56 1715965676

Estimation is counting with error bars.

Frankly, most of what you consider counting in your comment needs error bars - ask anyone who operated an all-cash cash-register how frequently end-of-day reconciliation didn't match the actual cash in the drawer (to the nearest dollar.)

The following is a list from my personal experience - of presumably precisely countable things that didn't turn out to be the case: the number of computers owned by an fairly large regional business, the number of (virtual) servers operated by a moderately sized team, the number of batteries sold in a financial year by a battery company.

shkkmo · 2024-05-17T19:58:19 1715975899

Counting is a subset of estimation, not a synonym.

If I estimated the number of quarters in a stack by weighing them, that would be different from estimating the number of quaters in a stack by counting them. Both methods of estimation have error bars.

The list you provide is of categories that don't have clear definitions. If you have a sufficiently clear definition for a category given your population, it has a precise count (though your counting methodologies will still be estimates.) If your definition is too fuzzy, then you don't actually have a countable set.

throwaway290 · 2024-05-18T08:19:32 1716020372

It's close enough to counting for the purposes of a magazine article like uuids are close enough to being unique for the purposes of programming.

shkkmo · 2024-05-18T22:30:19 1716071419

The algorithm accuracy scales with the ratio of memory to set size so you don't actually know if it is "close enough" without an estimate of of the set size.

I think the headline is clickbaity and the article makes no effort to justify it's misuse of the wors 'counting'. The subheadline is far more accurate and doesn't use that many more words.

Levitz · 2024-05-20T09:48:27 1716198507

I think I get your point completely, yet I'm not getting through.

Would you agree that 1+1=2? Or that pi is 3.14159...? These are mathematical truths, but quickly crumble in the real world. One apple plus one apple doesn't just equate to double the apple, no two apples are ever the same to begin with, there are no real perfect circles out there either, there is still value to those mathematical truths in that they make it evident that they are perfectly precise and that it is real world interaction which may bring error into the table.

seadan83 · 2024-05-17T20:46:15 1715978775

Counting and estimation are different by definition. One is a full enumeration, the latter an extrapolation from sampled data. In both cases 'accuracy' is a factor. Even if we are counting the number of stars, it is still a difference of technique compared to estimating the number if stars.

I could try to count fibers in muscle or grains of sand in the beach, chances are accuracy would be low. One can be smart about technique for more accurate counts, eg: get 10M sand counters and give them each 1kg of sand which they then count the grains with tweezer and microscope. That is counting. At the same time, we could find an average count of grains in 1kg sand from a random 100 of our counters, and then estimate what an expected total would be. The estimate can be used to confirm the accuracy of the counts.

naijaboiler · 2024-05-17T23:11:45 1715987505

They are not really as far apart a you think. At small numbers, yes get are distinct. At large enough numbers, they in all practicality the same thing. E.g what’s the population of the US

jameshart · 2024-05-17T22:35:22 1715985322

And by that definition this is a counting algorithm.

Koshkin · 2024-05-17T20:42:15 1715978535

True - for (relatively) small numbers. For large (huge) numbers estimation is usually considered to be equivalent to counting, and the result is sometimes represented using the "scientific" notation (i.e. "floating-point") rather than as an integer. For example, the mole is an integer whose value is only known approximately (and no one cares about the exact value anyway).

YoshiRulz · 2024-05-18T00:58:38 1715993918

As of May 2019, the mole has an exact value, and Carbon-12's molar mass is the empirically-determined value.

OutOfHere · 2024-05-17T21:21:34 1715980894

This doesn't justify estimation to be equivalent to counting even if some mathematicians consider them to be the same. Floating points are for estimation. Integers are for counting. The two are not the same, not even for large numbers.

Koshkin · 2024-05-17T21:24:39 1715981079

"Equivalent" and "the same" are sometimes equivalent. (Or the same.)

It depends on what the meaning of the word 'is' is.

https://libquotes.com/bill-clinton/quote/lby0h7o

dools · 2024-05-17T08:45:01 1715935501

It's an approximation, not an estimation.

blackkettle · 2024-05-17T10:05:59 1715940359

Actually, my understanding is that it is an estimation because in the given context we don't know or cannot compute the true answer due to some kind of constraint (here memory or the size of |X|). An approximation is when we use a simplified or rounded version of an exact number that we actually know.

dools · 2024-05-18T08:12:19 1716019939

Wikipedia is on your side:

"In mathematics, approximation describes the process of finding estimates in the form of upper or lower bounds for a quantity that cannot readily be evaluated precisely"

This process doesn't use upper and lower bounds.

However, it still seems more like approximation than estimation to me because of this:

“Of course,” Variyam said, “if the [memory] is so big that it fits all the words, then we can get 100% accuracy.

It seems that in estimation the answer should be unknowable without additional information, whereas in this case it's just a matter of resolution or granularity because of the memory size.

Anyhoo ...

EDIT: also the paper says "estimate" and the article says both "approximate" and "estimate" at different times so it seems everyone except me thinks it's either an estimation or that estimation and approximation are interchangeable.

ranguna · 2024-05-17T08:53:06 1715935986

Still very different things, no?

coldtea · 2024-05-17T09:18:36 1715937516

It's the same thing at different degrees of accuracy. The goal is the same.

amelius · 2024-05-17T10:30:15 1715941815

Still, counting things and counting unique things are two different procedures.

chrisweekly · 2024-05-17T12:26:43 1715948803

For someone who's pretty well-versed in English, but not a math-oriented computer scientist, this seems like a distinction without a difference. Please remedy my ignorance.

lupire · 2024-05-17T12:48:47 1715950127

My GP was wrong, but the words are different.

Eatimation is a procedure the generates an estimate, which is a kind of approximation, while approximation is a result value. They are different "types", as a computer scientist would say. An approximation is any value that is justifiably considered to be nearly exact. ("prox" means "near". See also "proximate" and "proxy".)

Estimation is one way to generate an approximation. An estimate is a subtype of an approximation. There are non-estimation ways to generate an approximation. For example, take an exact value and round it to the nearest multiple of 100. That generates an approximation, but does not use estimation.

jameshart · 2024-05-17T13:39:43 1715953183

I’m not sure the linguistic differences here are as cut and dried as you would like them to be. Estimate and approximate are both verbs, so you can derive nouns from them both for the process of doing the thing, and for the thing that results from such a process.

Estimation is the process of estimating. It produces an estimate.

Approximation is the process of approximating. It produces an approximation.

You can also derive adjectives from the verbs as well.

An estimate is an estimated value.

An approximation is an approximate value.

But you’re right that the ‘approximate’ terms make claims about the result - that it is in some way near to the correct value - while the ‘estimate’ derived terms all make a claim about the process that produced the result (ie that it was based on data that is known to be incomplete, uncertain, or approximate)

davidmurdoch · 2024-05-17T11:54:10 1715946850

The authors of the article disagree with you.

dools · 2024-05-18T08:02:44 1716019364

The authors of the paper disagree with me, the authors of the article don't (they use both approximate and estimate, but the paper does say estimate).

akamoonknight · 2024-05-17T07:19:27 1715930367

I don't know a word or phrase for this, but I really enjoy any examples of "thinking outside the box" like this because it's something I struggle with in my professional career. Learning not only the right ways to solve problems, but figuring out the questions to ask that make solving the problems you have easier or even in some cases possible. In this case, it's hey, we don't need exact numbers if we can define a probabilistic range given defined parameters. Other problems are gonna have other questions. I guess my hope is that if I see enough examples I'll be able to eventually internalize the thought process and apply it correctly.

dentemple · 2024-05-17T13:36:47 1715953007

To be fair, this was a university research team. Literally, a team of folks who can, all day everyday, iterate over a single topic using the Scientific Method.

If you were paid by a big company to sit at a whiteboard all day with a team of equally intelligent engineers, I'm sure you'd be come up with SOMETHING that would look like an "outside the box" solution to the rest of the world.

However, most of us are paid to work the JIRA factory line instead, which limits the amount of time we can spend experimenting on just one single problem.

wmwragg · 2024-05-17T07:34:27 1715931267

I think it's generally thought of as "lateral thinking", Edward de Bono has written a few books about it you might find interesting.

coldtea · 2024-05-17T09:17:27 1715937447

And some more commonplace words like "creativity" (as in "creative solution") etc. would apply.

bzuker · 2024-05-17T12:46:01 1715949961

any particular one you'd recommend?

wmwragg · 2024-05-17T23:14:50 1715987690

I think the classic is "Lateral Thinking: A Textbook of Creativity"

rep_lodsb · 2024-05-17T09:59:36 1715939976

>What if you’re Facebook, and you want to count the number of distinct users who log in each day, even if some of them log in from multiple devices and at multiple times?

Seems like a bad example of when this algorithm could be useful in practice.

If you already know you will want this info when designing the login process, it's simple: keep track of the date of last login for each account, and increment the unique user counter when the stored value is different from the current one.

And even if not, it should still be possible to "replay" a stream of login events from the database later to do this analysis. Unless maybe you already had years worth of data?

zvr · 2024-05-29T22:28:49 1717021729

No, what you propose needs to "keep track of the date of last login for each account", so you need memory of the size of your user population. The idea is to perform it with much smaller, fixed amount of memory.

gnfargbl · 2024-05-17T11:03:58 1715943838

On the topic of counting things, I would like to mention this efficient and easily-implemented algorithm for finding the top-k items in a stream, which I think is perhaps not as well known as it should be:

A Simple Algorithm for Finding Frequent Elements in Streams and Bags

Karp, Shenker & Papadimitriou

https://www.cs.umd.edu/~samir/498/karp.pdf

vanderZwan · 2024-05-17T13:20:19 1715952019

> the top-k items in a stream

Hmm, this is phrased in a way that sounds different (to my ears) than the abstract, which says:

> it is often desirable to identify from a very long sequence of symbols (or tuples, or packets) coming from a large alphabet those symbols whose frequency is above a given threshold

Your description suggests a finding fixed nr of k items, with the guarantee that it will be the top ones. The abstract sounds like if determines an a priori unknown number of items that meet the criteria of having a particular value greater than k.

So "find the 100 oldest users" vs "find all users older than 30".

Am I misunderstanding you or the abstract? (English is not my first language)

gnfargbl · 2024-05-18T10:23:13 1716027793

It's been a while since I used this in anger, but my recollection is that it maintains a fixed-size set of items. The same algorithm is in section 3.2 of https://erikdemaine.org/papers/NetworkStats_TR2002/paper.pdf and might be clearer there.

vanderZwan · 2024-05-20T08:02:27 1716192147

Thank you, that is much easier to follow indeed! Very elegant, I agree :)

eterevsky · 2024-05-17T12:05:26 1715947526

Computer scientists invent a memory-efficient way to estimate the size of a subset

m3kw9 · 2024-05-17T13:45:33 1715953533

It seems fast too as you can use less rounds of flips and get an estimate meaning you may not need to go thru the entire “book” to get an estimate of distinct words

seadan83 · 2024-05-17T20:57:34 1715979454

The subset is very important, it is the subset of unique elements.

Koshkin · 2024-05-17T21:03:09 1715979789

It's not a "subset," it's "the set of equivalence classes."

seadan83 · 2024-05-18T17:30:03 1716053403

I think it's by definition that the unique elements of a set is a subset. Nonetheless, the clarification is a good point.

To take a stab at a yet more correct statement, maybe something like the follows captures it: In the algorithm noted, we are extrapolating the size of the unique elements by looking at additional subsets, which themselves are an equivalence class.

eterevsky · 2024-05-18T10:24:52 1716027892

This is implied by the word "set" in "subset". Otherwise it would've been a multiset.

kuldeepmeel · 2024-05-18T19:39:06 1716061146

We are very grateful for the interest, and I thought I would link to some relevant resources.

Paper: https://arxiv.org/pdf/2301.10191 Knuth's note: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

Talk Slides: https://www.cs.toronto.edu/~meel/Slides/meel-distinct.pdf Talk Video: https://www.youtube.com/watch?v=K_ugk7OW0bI

The talk also discusses the general settings where our algorithm resolved the open problem of estimation of the union of high dimensional rectangles.

imoverclocked · 2024-05-17T07:02:42 1715929362

When do we stop calling this counting and start calling it estimation?

card_zero · 2024-05-17T07:53:09 1715932389

Seems this is one of those things like UUIDs where we rely on it being very unlikely to be wrong, because statistics.

> the accuracy of this technique scales with the size of the memory.

I wonder if that's proportional to the number of distinct items to count, though.

> if the [memory] is so big that it fits all the words, then we can get 100% accuracy

Yes, but then the algorithm isn't being used any more, that's just normal counting.

They counted the distinct words in Hamlet with a memory size of 100 words, about 2.5% of the number to find, and got a result that was off by 2. If you do the same with the whole of Shakespeare, again using 2.5% of the memory needed to hold all the distinct words, is the accuracy better?

Anyway, this is limited to counting, and doesn't help list what the words are, though quickly counting them first is perhaps a way to speed up the task of actually finding them?

BlackFly · 2024-05-17T10:00:04 1715940004

When it is actually impossible to count something and when the error between estimation and an exact answer is not significant the pedantic distinction is not helpful.

The same thing happens with measurement. No measurement is ever exact. If I said I was measuring a count, someone would probably correct to say that I am counting.

Common speech is like that.

vlovich123 · 2024-05-17T07:16:07 1715930167

As soon as people start reading past the headline.

112233 · 2024-05-17T07:25:34 1715930734

tbh, the title (and introduction) did a lot to dissuade me from finishing the (really good) article. It was actually informative, why dress it as a SEO blogspam?

seanhunter · 2024-05-17T07:41:32 1715931692

Presumably so it is optimised for search engines and people find it.

Publishers generally have good data about where their audience comes from. They wouldn't do this if it wasn't the best way they know of to maximise readership.

a57721 · 2024-05-17T19:25:18 1715973918

I've been following Quanta for some time, I'm sure they don't care about SEO and number of visitors, they care about the quality of texts. They write for the general audience, and even though they try to preserve the scientific accuracy, sometimes their explanations may seem oversimplified and even a bit confusing when you come from the same field. It's not clickbait, it's their popular science style.

HDThoreaun · 2024-05-17T07:46:06 1715931966

So they can get paid for their work? Are you giving them anything?

lupire · 2024-05-17T12:38:10 1715949490

It's Quanta. Their mission is to make laypeople like math (not understand math), so they drown the math in sugar.

pmontra · 2024-05-17T07:13:59 1715930039

Yes, even the subtile states "a simple algorithm for estimating large numbers of distinct objects".

cubefox · 2024-05-17T09:34:40 1715938480

This doesn't excuse the title though.

mudiadamz · 2024-05-17T13:05:48 1715951148

Python implementation:

  def streaming_algorithm(A, epsilon, delta):
      # Initialize parameters
      p = 1
      X = set()
      thresh = math.ceil((12 / epsilon ** 2) * math.log(8 * len(A) / delta))

      # Process the stream
      for ai in A:
          if ai in X:
              X.remove(ai)
          if random.random() < p:
              X.add(ai)
          if len(X) == thresh:
              X = {x for x in X if random.random() >= 0.5}
              p /= 2
              if len(X) == thresh:
                  return '⊥'

      return len(X) / p


  # Example usage
  A = [1, 2, 3, 1, 2, 3]
  epsilon = 0.1
  delta = 0.01

  output = streaming_algorithm(A, epsilon, delta)
  print(output)

ericmcer · 2024-05-17T13:33:25 1715952805

I don't think there is a single variable name or comment in this entire code block that conveys any information. Name stuff well! Especially if you want random strangers to gaze upon your code in wonder.

tgv · 2024-05-17T13:48:51 1715953731

The names are literally taken from the paper.

seaman1921 · 2024-05-17T17:03:22 1715965402

well the paper also contains the code so I doubt anyone who looked at the paper cares about this paste - for folks who did not read the paper this is not very readable

sneva · 2024-05-17T13:51:30 1715953890

> Name stuff well

OP is following the same variable names of the article. I prefer that over changing the variable names and then figuring out what variable name maps in code to the article.

foobarian · 2024-05-17T13:43:44 1715953424

Speaking of, one of my favorite discoveries with Unicode is that there is a ton of code points acceptable for symbol identifiers in various languages that I just can't wait to abuse.

>>> ᚨ=3

>>> ᛒ=6

>>> ᚨ+ᛒ

9

8372049 · 2024-05-17T15:58:36 1715961516

You're even using a and b, very good.

pragma_x · 2024-05-17T16:32:14 1715963534

Ah yes, programming like the vikings intended.

rcarmo · 2024-05-17T13:48:04 1715953684

That's not streaming if you're already aware of the length of the iterable.

axitanull · 2024-05-17T16:39:01 1715963941

In python, you can simply substitute `A` with an iterable or generator object, which can be a of unknown length.

mattkrause · 2024-05-17T17:33:36 1715967216

But for this algorithm, you need to know the total length ("m") to set the threshold for the register purges.

Does it still work if you update m as you go?

cb321 · 2024-05-17T20:18:32 1715977112

Besides the ideas from istjohn, empath-nirvana, and rcarmo, you can also just "flip the script": solve for epsilon and report that as 1-delta confidence interval for the worst case data distribution as here: https://news.ycombinator.com/item?id=40388878

Best case error is of course zero, but if you look at my output then you will see as I did that the worst case is a very conservative bound (i.e. 15X bigger than what might "tend to happen". That matters a lot for "space usage" since the error =~ 1/sqrt(space) implying you need a lot more space for lower errors. 15^2 = 225X more space. Space optimization is usually well attended for this kind of problem. And, hey, maybe you know something about the input data distribution?

So, in addition to the worst case bound, average case errors under various distributional scenarios would be very interesting. Or even better "measuring as you go" enough distributional meta data to get a tighter error bound. That latter starts to sound like it's one of Knuth's Hard Questions Which if You Solve He'll Sign your PhD Thesis territory, though. Maybe a starting point would be some kind of online entropy(distribution) estimation, perhaps inspired by https://arxiv.org/abs/2105.07408 . And sure, maybe you need to bound the error ahead of time instead of inspecting it at any point in the stream.

istjohn · 2024-05-17T19:23:22 1715973802

You would want to calculate the threshold by choosing your target epsilon and delta and an 'm' equal to the largest conceivable size of the stream. Fortunately, the threshold increases with log(m), so it's inexpensive to anticipate several orders of magnitude more data than necessary. If you wanted, you could work backwards to calculate the actual 'epsilon' and 'delta' values for the actual 'm' of the stream after the fact.

empath-nirvana · 2024-05-17T18:58:48 1715972328

You actually don't need to do that part in the algorithm. If you don't know the length of the list, you can just choose a threshold that seems reasonable and calculate the margin of error after you're done processing. (or i guess at whatever checkpoints you want if it's continuous)

In this example, they have the length of the list and choose the threshold to give them a desired margin of error.

rcarmo · 2024-05-17T20:12:42 1715976762

See https://news.ycombinator.com/item?id=40390192

rcarmo · 2024-05-17T20:12:53 1715976773

I know that. See: https://news.ycombinator.com/item?id=40390192

planede · 2024-05-17T13:12:49 1715951569

  return '⊥'

what's this?

rcarmo · 2024-05-17T14:16:24 1715955384

An error condition. I decided to do away with it and take a small hit on the error by assuming the chances of the trimmed set being equal to the threshold are very small and that the error condition is effectively doing nothing.

I also changed the logic from == to >= to trigger unfailingly, and pass in the "window"/threshold to allow my code to work without internal awareness of the length of the iterable:

    from random import random

    def estimate_uniques(iterable, window_size=100):
        p = 1
        seen = set()

        for i in iterable:
            if i not in seen:
                seen.add(i)
            if random() > p:
                seen.remove(i)
            if len(seen) >= window_size:
                seen = {s for s in seen if random() < 0.5}
                p /= 2
        return int(len(seen) / p)

I also didn't like the possible "set thrashing" when an item is removed and re-added for high values of p, so I inverted the logic. This should work fine for any iterable.

jbaber · 2024-05-17T13:18:46 1715951926

In some symbolic logic classes, that character "bottom" represents "false" ad flipped "top" means true.

Don't know what they're getting at in the code, though.

hollerith · 2024-05-17T13:34:05 1715952845

>In some symbolic logic classes, that character "bottom" represents "false"

That's unfortunate, because in the study of computer programming languages, it means "undefined" (raise an error).

tsss · 2024-05-17T14:48:45 1715957325

Not always. It is also the uninhabited bottom type.

hollerith · 2024-05-17T15:32:44 1715959964

My point is that there is a difference between a Python function's returning false and the function's raising an error, and sometimes the difference really matters, so it would be regrettable if logic teachers actually did use ⊥ to mean false because programming-language theorists use it to mean something whose only reasonable translation in the domain of practical programming is to raise an error.

I have no idea what your point is.

dentemple · 2024-05-17T13:28:49 1715952529

Once again proving the need for comments in code. Especially for comments that are more useful than "initialize parameters"

martinsnow · 2024-05-17T13:21:49 1715952109

An easy way to identify who copies code without understanding it.

mudiadamz · 2024-05-17T13:27:22 1715952442

You can just replace it with something like: print ('Invalid thresh or something')

martinsnow · 2024-05-17T17:09:48 1715965788

This however looks scary so an innocent copy/paste programmer wouldn't touch it.

abootstrapper · 2024-05-17T23:13:37 1715987617

New leetcode hard question just dropped.

sestep · 2024-05-17T13:27:38 1715952458

Is this ChatGPT? Also, I feel like this would be more useful if it included import statements.

mudiadamz · 2024-05-17T13:28:13 1715952493

  import math
  import random

thesz · 2024-05-17T08:37:02 1715935022

HyperLogLog uses additions, it keeps sums. Thus, you can subtract one HLL sums from other. This is useful if stream supports deletion. Streams with deletions can be found in log-structured merge trees, for one example, so one can estimate count of distinct elements in all of the LSM tree hierarchy.

The algorithm in the paper does not allow for deletions.

Also, if one counts statistics of the stream of large elements (say, SHA-512 hashes, 64 bytes per hash), this algorithm requires some storage for elements from this stream, so memory requirement is O(table size * element size).

saulrh · 2024-05-17T07:38:49 1715931529

Huh, that's a clever twist on reservoir sampling. Neat.

vitus · 2024-05-17T13:04:25 1715951065

From the paper [0]:

> We state the following well-known concentration bound, Chernoff bound, for completeness.

Which variant of the Chernoff bound is this? This is almost the (looser variant of the) multiplicative form, but it's not quite right (per the use of 1+delta instead of a single parameter beta). In particular, that bound is only guaranteed to hold for delta >= 0 (not beta = 1 + delta > 0 as asserted in the paper)

[0] https://arxiv.org/pdf/2301.10191

[1] https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_...

edit: to be clear: I'm not sure at all whether this breaks the proof of correctness, although I'm having a bit of difficulty following the actual details (I think I'd need to work through the intermediate steps on paper).

oertl · 2024-05-18T09:22:58 1716024178

The algorithm is simple, which is why it is also suitable for textbooks. However, it is far from representing the state of the art. In case you are interested in the latest development, I have recently published a new distinct counting algorithm where the estimate does not depend on the processing order, which can be merged and requires only 3.5kB of memory to guarantee a standard error of 1.13% for counts up to exa-scale: https://arxiv.org/abs/2402.13726

j_m_b · 2024-05-18T13:59:57 1716040797

I jumped on the code generation bandwagon pretty soon after ChatGPT was released. I like to think of similar to a catalyst in a chemical reaction, it lowers the activation barrier to writing code, especially in new contexts (to you). It makes it easier to just get started. However, it struggles with solving actual problems or even putting together the right context. If you know how to break the problem down to simpler steps, it can help with those. It won't write a new kernel driver for you.

helix278 · 2024-05-18T19:38:00 1716061080

Wrong thread?

pierrebai · 2024-05-17T18:53:01 1715971981

IDK, if my explanation is correct, but I do believe it is. I t goes as follow.

Imagine that you have a container of potential limitless capacity. The container starts with smalls capacity, equal to the real limited capacity that the real algorithm uses. As you add elements, when the container is full, its capacity is doubled, but all elements are then placed in a random position.

When you're done, you're told the occupancy of the subset of the large container corresponding to the initial size and how many times the container size was doubled. Multiplying that occupancy by the power of two of the number of doubling gives you an approximation of the real size.

The small catch is that in the actual algorithm, due to the discarding, the final number of elements, the occupancy, is somewhat erroneous.

EDIT

Another way to say this: you got a container of limited capacity S. When full, you "virtually" double its size and then randomly move elements over the full "imaginary" size of the virtual container. So after the first filling, you end up with about 1/2 the elements. After the second filling 1/4, etc. Also, since now your "virtual" container is larger, when you add a new element, there is only 1/2^n the it will be place inside your limited-capacity view of the entire virtual container.

At the end, the approximate real count is the number of elements you got multiplied by 2 to the power of the number of size doubling.

Again, it is as if you have a small window into a limitless container.

burjui · 2024-05-17T18:08:49 1715969329

The algorithm uses less memory, but more CPU time because of rather frequent deletions, so it's a tradeoff, not just generally better algorithm, as article may suggest.

empath-nirvana · 2024-05-17T18:55:59 1715972159

the list is small so the cost of deletions should be small.

unnouinceput · 2024-05-17T10:22:25 1715941345

"Computer scientists invent an efficient new way to approximate counting large unique entries" - fixed the title