10-15 years ago, I found myself needing to regularly find the median of many bil...

justinpombrio · on July 25, 2024

Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.

(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)

rented_mule · on July 25, 2024

The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.

Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.

jiggawatts · on July 26, 2024

Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.

As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.

I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"

_dain_ · on July 26, 2024

What's wrong with doing it in two passes? N iterations each doing 2 operations is exactly the same cost as 2*N iterations each doing 1 operation. Because multiplication is commutative.

buildbot · on July 26, 2024

This is true, but a big advantage of the single pass method is data reuse. Instead of loading each element twice, you load each element once. Per bit, reading/writing to external memory is massively more energy intensive than any compute op. Orders of magnitude more.

srcreigh · on July 26, 2024

It’s more like 2NM where M is loading the data from disk/memory. One pass is 2N+M.

Why go to the store and back twice to buy two things instead of buying two things in one trip? ;p

jiggawatts · on July 27, 2024

I think you meant 2N+2M vs 2N+M, but yes, that’s the point: not reading twice is cheaper because unlike in traditional big-O analysis compute is cheap but data access is very expensive.

srcreigh · on July 28, 2024

Yep, thx.

davidmurdoch · on July 26, 2024

Does C# have the plumbing for this built in? It's been 7 years since using it so I might not be remembering correctly.

neonsunset · on July 26, 2024

IEnumerable<T>.Max and .Min are the same as they were just significantly faster through the use of SIMD: https://github.com/dotnet/runtime/blob/ebbebaca1184940f06df6...

You could implement a similar (and simpler) fast path for types with contiguous memory by performing min and max per iteration instead.

davidmurdoch · on July 27, 2024

I was thinking LINQ was optimized to do a single enumeration for cases such as this. Can LINQ be easily, naturally, and idiomatically used with `IEnumerable<T>.Max` and `.Min` to avoid looping twice, or is it more of a more rigorous and intentional optimization the engineer would have to seek out?

cerved · on July 26, 2024

what's preventing you from doing it in one pass?

sroussey · on July 26, 2024

Actually, seeking the bias numbers can be quite illuminating.

rented_mule · on July 26, 2024

You're absolutely right! Some learning that we came to later that isn't unrelated to what you're saying... don't just look at metrics (in the case I've described above, it was timings of operations in a large system), but look at histograms for them. You should be able to explain why they have the shape they do. The distributions are so often multimodal, and understanding the modes helps you understand a lot more nuance about your system.

We were an infra-focused team building up a scalable data handling / computation platform in preparation to apply ML at a non-trivial scale. When we finally hired some people with deep stats knowledge 10 or 12 years ago, there was a lot of great learning for everyone about how each of our areas of expertise helped the others. I wish we'd found the right stats people earlier to help us understand things like this more deeply, more quickly.

sroussey · on July 27, 2024

Even some regular engineers with experience dealing with many servers will have built up an intuition to exploit.

In fact, this is often where decisions on what metrics to have in the first place come from. Ask why. You can go far without deep stats knowledge!

enriquto · on July 25, 2024

> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

B1FF_PSUVM · on July 26, 2024

> this does not mean that the median can be found accurately by your method.

You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.

rhymer · on July 26, 2024

Asymptotic properties of quantile estimators are widely studied [1]. The key is to have a sufficiently large sample size.

[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.

mjburgess · on July 26, 2024

Yet, for any given distribution the sample size can be arbitrarily close to infinite. Unless I've missed something, I don't see the relevance.

If you want the n-9s rate of failure (eg., n=5, 99.999) for a system with a power-law performance distribution, you could be waiting much more than a billion samples to see a failure.

eg., 3E10 ms (30 bn samples) in a year, at 5-9s failure rate has 3E3 ms of failure (3 thousand samples) -- which will be lower given a sampling strategy.

rhymer · on July 27, 2024

Relevance: from the cited paper, the variance of the median estimator is proportional to 1/(n * f^2), where n is the sample size and f is the density at median.

Two observations: 1. With sufficiently large n, you can control your variance to an acceptable level. 2. The factor f is outside of your control. If your problem has a small density around the median, then you'll need to throw more samples to compensate for it.

I think your concern is about #2: you can always construct a pathological distribution to make the sampling-based approach unattainable. The paper provided guidance on what to expect when you apply the sampling-based approach.

maronato · on July 25, 2024

the key word is “uniformly”. If your data is not uniformly distributed, then you just have to pick random values non-uniformly. There are many ways to do that, and once you have your algo you’ll be able to reliably find an approximation of the median much faster than you would find the actual median.

https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...

justinpombrio · on July 26, 2024

I wasn't saying that you could get within 1% of the true median, I was saying you could find an element in the 49th to 51st percentile. In your example, the 49th percentile would be -90 and the 51st percentile would be 90, so the guarantee is that you would (with very high probability) find a number between -90 and 90.

That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.

andruby · on July 25, 2024

I was thinking the same thing.

In all use-cases I've seen a close estimate of the median was enough.

hhmc · on July 28, 2024

You also can use the fact that for any distribution, the median is never further than 1SD away from the mean.

digaozao · on July 26, 2024

I am not sure. But from the outside, it looks like what Prometheus does behind the scenes. It seems to me that Prometheus works like that because it has a limit on latency time around 10s on some systems I worked. So when we had requests above that limit it got all on 10s, even though it could be higher than that. Interesting.

Filligree · on July 26, 2024

Was this by any chance for generating availability metrics, and were you an intern at the time? The system sounds, ah, very familiar.

rented_mule · on July 26, 2024

The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)

ant6n · on July 25, 2024

I’m not sure why you use a dictionary with keys 0…999, instead of an array indexed 0…999.

rented_mule · on July 26, 2024

I was using the term dictionary for illustration purposes. Remember, this was all in the context of MapReduce. Computation within MapReduce is built around grouping values by keys, which makes dictionaries a natural way to think about many MapReduce oriented algorithms, at least for me. The key/value pairs appear as streams of two-tuples, not as dictionaries or arrays.

tomrod · on July 26, 2024

That's just a dict/map with less flexibility on the keys :D

ashton314 · on July 26, 2024

Where were you working? Sounds like you got lucky to work on some fun problems!

rented_mule · on July 26, 2024

Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.

But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).

ashton314 · on July 26, 2024

> I'm trying to keep this account relatively anonymous

I totally respect that—don't let me pressure you any further.

Yeah, sounds like you had the ideal work setup: cool problem requiring cool technical solutions with good management. That's wonderful!