Hacker News new | past | comments | ask | show | jobs | submit login
Website data leaks pose greater risks than most people realize (seas.harvard.edu)
194 points by tonicb on Feb 5, 2020 | hide | past | favorite | 31 comments



Most companies still don’t know what anonymization means and confuse anonymized with pseudonymized or masked data.

Part of the problem is that there are still no good criteria available to define anonymity. Concepts like differential privacy are a step in the right direction but they still provide room for error, and in many cases they are either too restrictive (transformed data is not useful anymore) or too lax (transformed data is useful but can be easily re-identified).


It's not that most of them don't know what anonymization is or are confused about it.

Society is a tapestry of bullshit and low-level swindling is generally tolerated or quickly forgotten about. Thus, there's nothing to prod the unprincipled in charge to do the right thing. As long as something seems to be good(anonymized, in this cage), and problems can be hidden behind the corporate veil long enough, the unwritten rule is to half-ass security solutions because, well, security is boring and there's other things to devote company time and resources to(that will advance upper management).

Security measures, especially those that protect the users, don't make money. At best, they're insurance against the fallout that might occur when it's revealed that your company has been silently screwing people over. Like most human beings, businesses often put off serious consideration of the future in order to enjoy quick and immediate gain.

I wouldn't put it past most companies to screw up an approach like differential privacy. Not enough people actually care that much.


Security measures, especially those that protect the users, don't make money.

This is why the government has to make regulations with teeth in this space (of course, the government could be the "unprincipled in charge" you referred to).


> of course, the government could be the "unprincipled in charge" you referred to

Not specifically, but I suppose I wouldn't say that politicians are more or less principled than corporate executives. I know some would argue otherwise, but I'm too black pilled at this point to have faith in any "public servant".

Nevertheless, government regulation is probably the way to actually address these issues. Government may lock competence or will, but at least it provides us some leverage, little it may be.


And even the ones who do practice decent anonymization are generally contributing to the problem just by holding a lot of data.

Lots of companies are content to stop at "our data can't be linked back to a person's identity", which doesn't prevent building a uniquely-identifying user profile. (e.g. via browser fingerprinting, plus enough metadata to associate a user's computer and phone accounts.) Even if they do better than that, its typically "our data is not uniquely identifying in isolation", which still isn't enough. If your differential privacy model says that these four pieces of data have a specificity of 10,000 possible individuals, that's a good start. But if someone with an individual's PII and three of those keys comes looking, they can still narrow down information about the fourth value from your aggregates.

And even if no one screws up, what happens when someone queries a half dozen differential datasets for different subsets of a uniquely identifying key? It's something like the file-drawer problem, where one researcher hiding bad data is malicious, but a dozen studies failing to coordinate produces the same result innocently. If outright failures to anonymize become rarer, cross-dataset approaches become more rewarding.


As one step to raise awareness about the differences I really like this overview:

https://fpf.org/wp-content/uploads/2017/06/FPF_Visual-Guide-...


Having read about anonymization techniques I have started to believe that definitions of anonymity and pseudo-anonymity are well settle by now but criteria that contributes to the invariants for performing data transformation are not, so the result is that this criteria fail to guide the implementations of the transformations.

You keep data because data is economically valuable, but even when you care enough to implement some techniques that depends on the invariants you still fail to achieve something the better because of scale and because you don't want to refine the techniques. This also means that somehow somebody may have a technique that, provided enough pieces of data, can reverse you transformation.


Differential privacy provides a system that can allow the sharing of databases without allowing an external observer to determine if a particular individual was included.

If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

Today this might seem far-fetched, but it could come to pass in the future, when people raised in this environment and able to understand the implications and technical aspects come to political power.

https://www.cis.upenn.edu/~aaroth/privacybook.html

https://en.wikipedia.org/wiki/Differential_privacy


Differential privacy provides a lot less protection than you would think (or want to believe). A few months ago I saw a talk by E. Kornaropoulos, about his paper "Attacks on Encrypted Databases Beyond the Uniform Query Distribution"[0].

The main take-away from the talk - an in fact all the talks I saw on the same day - was that while DP is touted as a silver bullet and the new hotness, in reality it can not protect against the battery of information theoretical attacks advertisers have been aware of for couple of decades, and intelligence agencies must have been doing for a lot longer. Hiding information is really hard. Cross-correlating data across different sets, even if each set in itself contains nothing but weak proxies, remains a powerful deanonymisation technique.

After all, if you have huge pool of people and dozens or even hundreds of unique subgroups, the Venn-diagram-like intersection of just a handful will carve out a small and very specific population.

0: https://eprint.iacr.org/2019/441


Australian government released "anonymised" healthcare data to researchers. Within months a good chunk of it was deanonymised, including celebrities and some politicians themselves.

There's a lot of privacy snakeoil out there and even large govt departments fall for it.

https://pursuit.unimelb.edu.au/articles/the-simple-process-o...


This has happened with NIH data in the US as well. There is a preprint available.


Personally I'm not super bullish on differential privacy outside a couple specific use cases, but correlation attacks and cross referencing against external data are exactly the vectors that differential privacy is intended to protect against: it requires that the results of any query or set of queries would be identical with some probability even if a specific person wasn't present in the dataset.

It's possible I'm misreading, but your paper seems to focus on the very anonymization techniques diff privacy was invented to improve on, specifically because these kinds of attacks exist. While I agree it's no silver bullet, the reason is because it's too strong (it's hard to get useful results while providing such powerful guarantees) rather than not strong enough.

I've found the introduction to this textbook on it to be useful and very approachable if others are interested: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf


We're building an analytics system that is based on differential privacy / randomization of data. It's possible but there are many limitations and caveats, at least if you really care about the privacy and not just apply differential privacy as a PR move. Most systems that implement differential privacy use it for simple aggregation queries, for which it works well. It doesn't work well for more complex queries or high-dimensional data though, at least not if you choose a reasonably secure epsilon: Either the data will not be useful anymore or the individual that the data belongs to won't be reliably protected from identification or inference.

After spending three years working on privacy technologies I'm convinced that anonymization of high-dimensional datasets (say more than 1000 bits of information entropy per individual) is simply not possible for information-theoretic reasons, the best we can do for such data is peudonymization or deletion.


Database sharing has been (in theory) illegal inside the government for decades for this very reason. Why would private companies be allowed to do it ?


You posted your reply while I was writin my own. Do you happen to have pointers to any really good research results and/or papers?

I want to be better equipped to respond to this slowly emerging "DP is a silver bullet" meme and your response implies that you'd have actual research to back the position up.


We don't publish research papers, here's a good article from another privacytech startup (not ours) that discusses some of the shortcomings of differential privacy:

https://medium.com/@francis_49362/dear-differential-privacy-...

Here's my simple take: Imagine you want to protect individuals by using differential privacy when collecting their data. Imagine you want to publish datapoints that each contain only 1 bit of information (i.e. each datapoint says "this individual is member of this group"). To protect the individual, you introduce strong randomization: In 90 % of cases you return a random value (0 or 1 with 50 % probability), and only in 10 % of the cases you return the true value. This is differentially private and for a single datapoint it protects the individual very well, because he/she has very good plausible deniability. If you want a physical analogy, you can think of this as adding a 5 Volt signal on top of 95 Volt noise background: For a single individual, no meaningful information can be extracted from such data, if you combine the data of many individuals you can average out the noise and gain some real information. However, averaging out the noise also works if you can combine multiple datapoints from the same individual, if those datapoints describe the same information or are strongly correlated. An adversary who knows the values of some of the datapoints as context information can therefore infer if an individual is in the dataset (which might already be a breach of privacy). If the adversary knows which datapoints represent the same information or are correlated he can also infer the value of some attributes of the individual (e.g. learn if the individual is part of a given group). How many datapoints an adversary needs for such an attack varies based on the nature of the data.

Example: Let's assume you randomize a bit by only publishing the real value in 10 % of the cases and publish a random (50/50) value in the other cases. If the true value of the bit is 1, the probability of publishing a 1 is 55 %. This is a small difference but if you publish this value 100 times (say you publish the data once per day for each individual) the standard deviation of the averaged value of the randomized bits is just under 5 %, so an adversary who observes the individual randomized bits can already infer with a high probability the true value of the bit. You can defend against this by increasing the randomization (a value of 99 % would require 10.000 bits for the standard deviation to equal the probability difference), but this of course reduces the utility of the data for you as well. You can also use techniques like "sticky noise" (i.e. always produce the same noise value for a given individual and bit), in that case the anonymity depends on the secrecy of the seed information for generating that noise though. Or you can try to avoid publishing the same information multiple times, this can be surprisingly difficult though, as individual bits tend to be highly correlated in many analytics use cases (e.g. due to repeating patterns or behaviors).

That said differential privacy & randomization are still much more secure than other naive anonymization techniques like pure aggregation using k-anonymity.

We have a simple Jupyter notebook that shows how randomization works for the one-bit example btw:

https://github.com/KIProtect/data-privacy-for-data-scientist...


Thank you. That was a good read, and gave a few things to think about.


It's not far-fetched. Differential privacy is going to be used for the US census this year. Here's a report on it: https://arxiv.org/abs/1809.02201

Also, it's not a magical solution. Here's one of the issues from the linked paper (edited for clarity):

"The proponents of differential privacy have always maintained that the setting of the [trade-off between privacy loss (ε) and accuracy] is a policy question, not a technical one. [...] To date, the Census committee has set the values of ε far higher than those envisioned by the creators of differential privacy. (In their contemporaneous writings, differential privacy’s creators clearly imply that they expected values of ε that were “much less than one.”)


“In addition, the historical reasons for having invariants may no longer be consistent with the Census Bureau’s confidentiality mandate.”

us census, RIP


> If companies were required to aggregate information in this way and throw away their logs, perhaps leaks would be much less risky for their users.

One of the leaks they talk about way from Experian, a credit reporting agency. Not only would this approach work poorly for them, it wouldn't be legal (they need to be able to back up any claims they make about people, which requires going back to the source data).


I've considered how I would like E.G. GPS / driving apps to anonymize data.

For freeways, lots of small segments, and fuzzing of timestamps to co-mingle users. Where there's a stoplight snap the intersection cross-time to the green light (guess) for anyone in the queue.

The anonymity would come from breaking up both requests and observed telemetry to fragments too small to tie back to a single user or session (and thus form a pattern; I hope).

Do NOT record end-times, only an intended route. Do NOT associate that movement to any particular user or persistent session (ideally in memory on the mobile device only, not saved: though it could save favorite routes locally). Packages of transition times between various freeway exits would generally help add to anonymity.

That would also be part of generally improving the UI for the user. The application on the device should be making most of the decisions, by asking about the traffic in a given region on a grid. I also want it to show me (the driver) the data (heatmap) on the rejected routes so I know what isn't a good option.


Largely true, but there are HHS rules and guidelines that are accepted in the US healthcare space:

https://www.hhs.gov/hipaa/for-professionals/privacy/special-...


HIPAA data is not immune to a data leak... not even the organization that wrote those guidelines are immune:

https://www.deccanchronicle.com/technology/in-other-news/201...

There's tons of PHI on the internet. Your local hospital's online medical chart, your insurance companies bill-pay, etc...


The title refers to claims by marketing companies that they have appropriately anonymised the data, and is not an attack on the concept of anonymisation itself.


What does "computer science concentrator" or "statistics concentrator" mean? It's a first time I see such a title (?)


Harvard calls their fields of study "concentrations", not majors [0]. Thus, a CS concentrator is an undergraduate student who is majoring in CS.

[0]: https://en.wikipedia.org/wiki/Academic_major


Students have found data enrichment techniques exist and can be effectively applied to breach datasets. Good for them.


Yeah, I was a bit surprised when I read this was a project for a first year course Privacy and Technology (CS 105). I don't see it being reported anywhere other than Harvard's own website.


I think this should be sent to the government officials that they were able to find in their research, it might get them to wake up and stop treating it so lightly.


Relevant XKCD: https://xkcd.com/792/


Is it just data leaks? How about Google's reports on how busy a certain area is (restaurants, malls)? That is pretty much telling a potential terrorist the optimal time to target an area. We leak data everywhere, and all we need is a single bad actor to utilize it for a catastrophe to occur.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: