Hacker Newsnew | past | comments | ask | show | jobs | submit | mattb314's commentslogin

Super rough summary of the first half: in order to pick out random vectors with a given shape (where the "shape" is determined by the covariance matrix), MASS::mvrnorm() computes some eigenvectors, and eigenvectors are only well defined up to a sign flip. This means tiny floating differences between machines can result in one machine choosing v_1, v_2, v_3,... as eigenvectors, while another machine chooses -v_1, v_3, -v_3,... The result for sampling random numbers is totally different with the sign flips (but still "correct" because we only care about the overall distribution--these are random numbers after all). The section around "Q1 / Q2" is the core of the article.

There's a lot of other stuff here too: mvtnorm::rmvnorm() also can use eigendecomp to generate your numbers, but it does some other stuff to eliminate the effect of the sign flips so you don't see this reproducibility issue. mvtnorm::rmvnorm also supports a second method (Cholesky decomp) that is uniquely defined and avoids eigenvectors entirely, so it's more stable. And there's some stuff on condition numbers not really mattering for this problem--turns out you can't describe all possible floating point problems a matrix could have with a single number.


Thanks! So machine differences in their FP units drives the entropy, and the code doesn't handle that when picking eigenvectors?


It doesn't have to be their FP units; it could be that they run the operations in different orders, or that some different modes were set. I don't think the blog post goes into detail as to why but it does explain how this "cascades" into very different results coming out of the actual operation being performed.


Implementations of IEEE-754 can differ between machines, and the order of operations in a library/compiled function can also be different. It is not entropy, it is the nature of floating-point arithmetic.


Thanks. So sounds like the same machine should run the same calculations and get the same results each time, but differences may appear between machines.


The idea is that memory-only data systems like HyPer are able to make design decisions that make them significantly faster that disk-based systems (eg postgres), even when the working set fits entirely within cache for the disk-based system. Umbra attempts to act like an in-memory DBMS when the working set fits in memory while degrading gracefully as the working set grows beyond memory. Agree the title doesn’t have enough detail to see this though.


Wonder if this has anything to do with the sliding window:

> Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Default window looks like 1k documents. I read this as saying that super common words are basically dropped from the index (only 1k out of many thousands of docs retained), but I don’t know enough about the internals to be sure. Not sure if this actually hurts search results in practice, seems like an ok trade off for help docs at least.


It's definitely a great trade-off to make for efficiently, but makes it inherently unusable for most of elastic searchs usecases.

Looking at it from a practical example such as log search (almost everyone I know has used kibana/logstash/elasticsearch at some point): you'd be able to search for things like tracingId/requestId but adding more filters such as logLevel, requestType or serviceName would be impossible

It has it's niche, but calling it an elasticsearch alternative really is a stretch


Also the ability to weight fields when fetching results to boost relevancy, which is needed for a lot of my use cases.


I wonder how easy it would be to change "most recently pushed" to something like a redis sorted set where each document has a score and only the top N results are retained when sorted by their separate score value? That would allow you to sort by pageviews / popularity in a more useful way. But it fails entirely when looking for uncommon intersections of common words, which feels like it makes it useless for most actual full-text search use-cases :(


> Joules-Thomson effect is what allows for refrigeration

I don't think this is true. The "simple" model of refrigeration taught in highschool is just a carnot cycle running backwards, and this can be modeled with an ideal gas. The author of the post covers this the section on "the Thermodynamics 101 Answer"[1], where all you need to drop the temperature of a gas is to let it do work on the piston.

That's not to say that JT is not useful, just that we can explain a theoretical refrigerator without it.

[1] https://mattferraro.dev/posts/joule-thomson#the-thermodynami...


Yes, you can explain refrigeration with ideal gas. But...

You need non-ideal gas (attraction) to get temperature inversion. Then you just need compressor and voila, look at PT charts to find what temperature range you need. With ideal gas reverse carnot refrigerator your refrigeration effect is bounded on low temperature side by the available low temperature source.

So yes, you can refrigerate with ideal gas, but it's not very helpful in warm areas or if you need to get something super cold.


Heads up this weights all your scores towards 0. If you want to avoid this, an equally simple approach is to use (x+3)/(y+5) to weight towards 3/5, or any (x+a)/(y+b) to weight towards a/b. It turns out that this seemingly simple method has some (sorta) basis in mathematical rigor: you can model x and y as successes and total attempts from a Bernoulli random variable, a and b as the parameters in a beta prior distribution, and the final score to be the mean of the updated posterior distribution: https://en.wikipedia.org/wiki/Beta_distribution#Bayesian_inf...

(I saw first this covered in Murphy's Machine Learning: A Probabilistic Perspective, which I'd recommend if you're interested in this stuff)


Free fillable forms is hard to use for taxes in general, but this page is specifically for people who want the CTC without having filed taxes (presumably you otherwise would have gotten the CTC when you filed takes). You can see the screenshot says "non-filer sign up tool", where as the main page of the website (for people who want to actually use it to file taxes) has no such warning: https://www.freefilefillableforms.com/#/fd


I doubt it? At least the number times the "last updated" column appears on SQL server stats [1] leads me to believe it collects stats async with updates to the table.

The only system I've heard of that relies on up-to-date statistics for correctness is snowflake (long but interesting talk here [2]), where having accurate max/mins for each micro partition is really helpful for cutting down the amount of data in the large range scan queries common in BI. I'd guess that being a BI system, snowflake can get away with higher row update latency too.

[1] https://www.sqlshack.com/sql-server-statistics-and-how-to-pe...

[2] https://www.youtube.com/watch?v=CPWn1SZUZqE


I think I generally agree with the majority of the comments here that burn in can serve a useful purpose (especially if you can't find a high probability density point to start from), but I also wonder: if burn-in vs no burn-in makes a large difference in your outcome, aren't you likely just not running your chain long enough? Sure, if you choose a bad starting point, your initial samples might not be representative of the overall distribution, but if a handful of non-representative points can massively impact your result, then I'm not sure how stable your result was to begin with (how do you know there isn't some other set of low-probability high-impact points that your sampler just missed through luck?). People tend to have a cognitive bais towards distributions looking pretty (eg not having random chains off to the side as in the article), but I'm not sure it makes a real difference.

That said, I do think burn in is a pretty reasonable way to find a good starting point if you don't have existing knowledge about the distribution. From a practical standpoint, has anyone actually seen a massive difference between runs with/without burn in? kinda curious how often it really matters


> Sure, if you choose a bad starting point, your initial samples might not be representative of the overall distribution, but if a handful of non-representative points can massively impact your result, then I'm not sure how stable your result was to begin with (how do you know there isn't some other set of low-probability high-impact points that your sampler just missed through luck?).

You're right, and most comments I've seen over the years on the post conveniently miss that he addresses that:

> This unbiasedness argument is rubbish. If you start at x and I start at x then your MCMC run is no better than mine. If you used burn-in and I didn't, then you are entitled to woof about approximate unbiasedness and I am not. But that woof does not make your estimator any better.

My interpretation has always been this, and I think it's correct: You need a good starting point. There's no reason to think burn-in gives you a good starting point. Instead, use something that's actually intended to give a good starting point, like the mode.


For difficult problems, find the mode may (a) be as hard to find, or harder, as doing an MCMC sampling run, (b) be completely unrepresentative of the overall distribution.


I agree, but his argument is that in general doing a burn-in is still not going to be a substitute for good starting values, and if anything it's even easier to get a bad starting value using burn-in on a difficult problem.


If you have some nice idea of how to find a good starting value, then you should certainly use it, not just rely on burn-in.

But having used your good starting value, you should still discard some burn-in iterations. This is certainly true if you're running more than one chain, since including them all with this same starting value will bias the results (in a real, not just theoretical sense, though the magnitude of the bias will of course vary with your problem). Even if you're running just one chain, you should discard at least some burn-in (say 5%) even if you have no evidence that it is necessary, because you really don't know that your supposed good starting point is actually representative. (That is, you don't know this a difficult problems, which are the ones I'm discussing.)


I don't understand how the mode can be unrepresentative of the overall distribution. It seems like it's one of the finest representatives.


This can happen easily in Bayesian hierarchical models, where there is a hyperparameter that controls the variance of many lower-level parameters. When the variance is small, the probability density for these parameters is high (their distribution is sharply peaked), when the variance is large, the density is smaller (maybe many, many orders of magnitude smaller). So the mode will be where the variance is small, even if the data make this a much less probable region of the parameter space. (Note: the probability of a region is the product of its volume and its probability density - the total probability can be low even if the density is extremely high.)

You'll also typically get an unrepresentative mode for a neural network or other ML-type model, since the mode will be a highly-overfitted point.


If you're new to differential privacy and looking for an introduction, I highly recommend the Dwork and Roth book, especially the first three chapters: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

Frank McSherry also has some good resources if you enjoy his writing style: https://github.com/frankmcsherry/blog/blob/master/posts/2016...

In particular, I think it's important to keep in mind that differential privacy is as much focused on establishing a framework for measuring information leakage as it is coming up with clever algorithms to preserve privacy (although there are a lot of clever algorithms). I think of it as more analogous to big-O notation (a way of measuring) than to dynamic programming (an implementation technique).


I found the 6-part series at https://desfontain.es/privacy/differential-privacy-awesomene... a very good introduction as well.


I think the primary focus of differential privacy is that "the spice must flow". They need to keep collecting and using this data.


I guess. It's more like: figure out how to measure how much spice is flowing. The resulting knowledge will be a new tool: powerful and morally indifferent, as all tools. You choose how to use it.


Personally I'm not super bullish on differential privacy outside a couple specific use cases, but correlation attacks and cross referencing against external data are exactly the vectors that differential privacy is intended to protect against: it requires that the results of any query or set of queries would be identical with some probability even if a specific person wasn't present in the dataset.

It's possible I'm misreading, but your paper seems to focus on the very anonymization techniques diff privacy was invented to improve on, specifically because these kinds of attacks exist. While I agree it's no silver bullet, the reason is because it's too strong (it's hard to get useful results while providing such powerful guarantees) rather than not strong enough.

I've found the introduction to this textbook on it to be useful and very approachable if others are interested: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: