> when strictly speaking the raw data has indeed been deleted after being used to create a derivative work that can for all important purposes be used to recreate the original?
To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.
What they have done is distill some insights about people from this data. It's arguable whether they should be allowed to keep those insights, but there's no privacy risk there really.
It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.
That's only accurate in the sense that because an LSTM's hidden layer is much smaller in dimension than the data on which it is trained, there is less information in it.
However, it concisely represents a manifold in a much larger dimensional space and effectively captures most of the information in it.
It may be (and is) lossy, but don't underestimate the expressive power of a deep neural network.
You're throwing out buzzwords instead of addressing the response.
It's dimensionality reduction. You cannot recover the original object. It's like using a shadow to reconstruct the face of the person casting the shadow.
Note this has nothing to do with the expressive power of a deep neural network. You are by definition trying to throw away noisy aspects of the data and generalize a lower dimensional manifold from a high dimensional space. If it's not lossy, it won't generalize.
You're right that it's really just a form of dimensionality reduction. My point was just that it's a more powerful form of dimensionality reduction than PCA or NMDS.
[Edit: and that the salient characteristics are likely contained in the model.]
Precisely because it's more powerful, it doesn't encode the identifying information of the original data. Something like PCA likely would retain identifying characteristics (depending on how many low-rank vectors you drop).
Outside of the fact that they have identities for all of the people whose data they acquired, yes, it would be harder to reconstruct individual people with it than PCA because of the direct interpretability of its data.
They claim to have deleted that data. If they haven't deleted the data, then of course it's still an invasion of privacy. But the ML model really has nothing to do with it.
I think the ML model has a lot to do with it in this case.
One of the arguments I expect to see is that "Oh, no! We removed all the data. It's gone. I mean, that was only a few hundred megabytes per person anyway, but we just calculate a few thousand numbers from it and save in our system, then delete the data. That's less data per person than is needed to show a short cute cat GIF. What harm could we possibly do with that?"
My point isn't that there is no harm here in them storing this model. It's also not that the data in their model is worthless. It's specifically that the way this article is talking about the issue is incorrect. The analogy they use would lead you to draw false conclusions about what's going on, and how to understand it.
There is a real issue here of whether or not they should be allowed to keep a model trained from ill-gotten data. But the way I would think about it is: If you steal a million dollars and invest it in the stock market, and make a 10% return, what happens to that 10% return if you then return the original million? That's a much better analogy for what's going on here. They stole an asset, and made something from it, and it's unclear who owns that thing or what to do with it.
The ML model might know more about me than I’m willing to admit about myself. I only find some — but not much — comfort in the proposition that it can’t conjure my PII.
Regardless, I still think having the most relevant features already extracted is all they need to ask many of the questions they might want to. The point is that that’s still quite bad.
I don't think you understand what dimensionality reduction means.
SSN is a lookup key into the raw data. Dimensionality reduction is by definition lossy since it's used in scenarios where:
rows of data = n <<< m = number of features
Only if the "true" data actually lives in a lower dimensional manifold and the data acurrately can encode it with low noise. I doubt anyone can tell who you will vote for depending on which cat videos you liked, no matter how magic your regressor.
I do think that the most significant components of a personality will likely be targetable with a relatively low nuclear norm.
And, for example, where someone's proclivity on the exploration/exploitation spectrum, if you will, (IE, how strongly do they respond to fear-based messaging) falls is probably quite predictable from a spectrum of likes.
Cat pictures may be less informative, but not all of these people clicked exclusively on feline fuzzy photos.
I'm not an expert on personality so won't disagree (except to say that I am a little sceptical of a static personality profile actually existing and I think people who always vote a certain way would be the easiest to regress and also the most useless to target).
As I said in another post, it really depends on what part of your privacy you are trying to protect. It is also a mistake to think of anything on the internet as a private forum.
I think it is falsifiable. More precisely, the claim is that the most significant features for psychoanalytic purposes will be contained in the model after training even with low nuclear norm.
It’s possible for the most salient features for this purpose to not be in the model.
It was unclear the way I first said it.
I think a better analogy might be "If I take a few hundred thousand MP3s and come up with a clever way to reduce each to a short representation of its genre, mood, tempo, etc that can be used to identify similar music, then throw away all the original MP3s, am I still in possession of the original music". The whole point is to turn the individual data into broad, general categorisations that are easier to handle because they contain much less information. Remember, they're using this for ad targeting, and the reason they're doing it is so they can target broad groups of people rather than having to manually go through and target ads at each individual one by one.
I like that analogy. I'll make it more tenuous with - "I took a copy of your album collection without your permission, ripped them to MP3, played them so much everyone is sick of them. but you've still got all the original CD's you don't even use, so no problem right?"
On this tangent, IP ownership for deep learning models is interesting - how to you prove (in court) someone has/hasn't copied model/stolen a training set? If you fed someone else's training/model into your system, how easy is it to prove? Will we see the equivalent of map 'trap streets' in trained CNN models?
Except Facebook let them take a copy of the album collection, albeit for a different use case, but it was allowed nonetheless. That doesn't absolve CA in any way, but should make us wary of people we willingly give our "album collections" - they will use them to make money, and what they allow people do with them can easily be things we don't agree with, but didn't have the imagination to think of when we signed the EULA.
Especially when the dataset probably isn't that high in entropy. Something like PCA can drop the dimensionality by significant amounts as long as the data has enough clear signals in it.
" cannot use this data to recreate anything remotely resembling the original dataset."
This by itself may be mostly true perhaps - and many of the comments get into ways of playing with this dataset to make it better, I don't have experience with those methods, but,
what I have not seen anyone mention, if you have this dumbed down dataset, the original is gone.. you can still combine with other data sets that are either public or previously created and likely fine tune;
dumbed down set + public voter records + public arrest records + previous whatever records - sort, match, what's left over.
and pretty much recreate what you needed from the original, maybe not 100%, but I would guess you could get really close.
Or "better" (for some value of better), join the data to something identifiable (e.g., the public records you listed) before developing models for everything you want to retain, then discard the original data once your deep enough net has effectively auto-encoded whatever you wanted to retain in an identifiable manner.
So if I compress a BMP by 100:1 using "lossy" techniques then it's not equivalent to the original? I'd say that depends on how recognizable the result of reconstruction is, and not on the amount of reduction. MPAA would be very unhappy with your argument.
To be more extreme there are many compression/extraction methods that can perfectly reconstruct the original data with very high compression ratios. GIF/PNG can reproduce many images exactly. Certainly, they are derivative works?
This compression analogy is really a bad one. A much better analogy would be: If you read 10 novels by a particular author, and are now able to recognize his/her style, because you have a compressed representation in your mind of the way they write.
> It's arguable whether they should be allowed to keep those insights, but there's no privacy risk there really.
So if Google has distilled someone's emails over the years into "closeted homosexual with a deeply repressed leather fetish", that's not an invasion of their privacy as long as they throw away the source materials?
As long as they retain no data which could specifically identify the original person, yes. There is nothing wrong with building segmentation models as long as they aren't specific enough to identify a specific person.
My concern would be, how granular is too granular? What if we added "and live in zip code 12355 and is registered Green Party"? This now gets eerily specific, and might be sufficient to identify an individual.
Why would they ever discard that? Why would there be a granularity where ML suddenly stops working? Why would you even stop at one model per person, instead of one model per mood, or modes of thought at different stress points?
In fact they would desire that granularity most of all so as to reconcile the past and future state psychographic profiles for an individual- then they could attempt to isolate the causation of a state change- basically they need to identify the moment an individuals profile reflects the change from democrat to republican or vice versa. Or Religious to atheist etc.
Since the source data was deleted - according to current standards and policies - their hands are probably technically clean. But there may be another angle of attack.
In the US, you're not allowed to benefit directly from a crime you committed. For example, if you rob a bank, you can't buy your mother a car with the money and say "sorry, it's gone!" when the police come knocking.
With that line of reasoning and if there was a legal, privacy, or at least a TOS breach in collecting the data, the derivative machine learning models may be tainted also. Then again, it's likely impossible to prove exactly what data went into the model, so hard to establish which models might be tainted.
If they kept information like that, then yes that would be an invasion of privacy. But that sort of information is almost certainly not encoded in an ML model trained on 50 million people's data.
Let's say I take age and income of everyone in a city and train a regression model that predicts income from age. The model has slope and intercept that "encode" the information from all the people.
It would not be possible to make inferences about the income of any particular person from the slope and intercept, so it would be ok to share those values in, say, a journal article, even though disclosing income of a particular person would not be ok.
I know what they trained on because it's been reported on. They got around 50 million people's FB profiles, and a smaller subset's (300k, I think) personality test results.
I use ML models every day in my work, and understand how they function. It is true that individuals information is probabilistically encoded into the parameters of the model. However, if the model is any good, the people they trained on's information is encoded only a bit more than that of the entire population.
There is sort of a privacy issue in the following sense: The models they've built have learned relationships between preferences and personalities that they wouldn't otherwise have been able to learn. But these relationships are abstract. They are not tethered to any particular, identifiable individual.
A reasonable argument can be made that those learned relationships are, in a sense, stolen property. And I think arguments along those lines are interesting things that we'll have to explore as this sort of thing becomes more common. But the idea that this model invades individuals privacy just isn't really true.
Is there a reason that people are only talking about the privacy angle?
People very much don't want these models to exist. They don't want a predictive model which will guess their affiliation just by providing unrelated Activity bread crumbs.
That's why I assumed this whole issue has exploded recently.
But if the resulting model doesn't contain information about individuals, how does this help targeting individuals for the campaign?
Edit: is it that the model is then applied to only strictly public data about the person? If so I guess the interesting question then becomes whether the model is definitely not anything near overfitting (i.e. containing enough information to match a person's public data directly since it was trained on it (amongst other data))? (I'm not an ML developer.)
Edit 2: also, going with your comparison with the "20 most representative pixels", it seems interesting then that 'this much' (although not exactly sure how much) information can be inferred from a public profile when just also knowing enough about the whole Facebook population. OK, so perhaps a human would be able to infer about as much, but doesn't scale, and that's why the model becomes valuable?
> But if the resulting model doesn't contain information about individuals, how does this help targeting individuals for the campaign?
I don't know exactly what they were modeling, but from the published reports, it sounds like they were trying to predict big 5 personality characteristics (conscientousness, neuroticism, openness, extraversion, agreeableness) from FB profile data (e.g. likes, dislikes, bio, post content, etc.). So in that case, the model would contain weights that measure the strength of relationship between characteristics like "likes punk rock music" and "openness". That description really only literally applies to a linear model - but nonlinear models are, for these purposes, the same.
That's a ridiculous response. If they managed to infer this characteristic from emails, what they would keep is a tool which, given that set of emails again, infer the same characteristics (and theoretically a similar set of emails). They would by no means be allowed to keep the kind of information you described.
What is more relevant is a model which, given characteristics such as "closeted homosexual with a deeply repressed leather fetish", they would be able to infer other characteristics, such as support of particular political candidates, responsiveness towards targeted political or commercial ad campaigns, etc. That's what's relevant here.
> To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.
First off, I think that's wrong. The idea is after all to keep the information that will result in the smallest error compared to the original on the dimensions one cares about. Within what the model emphasizes a reconstruction can be not only "remotely resembling the original dataset" but as closely resembling the original dataset as is possible with the capacity of the representation.
Next, I'm really not talking only about the particular method described in the post. It's definitely possible to choose to make a light enough reduction to preserve the aspects of the information one is interested in, and to optimize for recall rather than generalization.
A more realistic context is going to be that some information about the affected individuals is still exposed or kept (maybe in a compact derived form), which would in many cases give excellent possibilities to restore information accurately enough that claims to have the removed the data are effectively deceptive.
Even for cases where the models are in good faith created only to "distill some insights" I'm skeptical that they really are useless for recovering individual information. I'm by no means an expert in differential privacy but I do listen when it comes up, and a lot of what we see from that field seems to come down to being able to trade off the relation between keeping the data useful and how many pieces of additional information (or assumptions and brute force) are needed to break the integrity protections. With surprises that tend to be on the side of 'Oops. Turns out this clever trick can recover the originals easier than we thought.'
> It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.
In my honest opinion the original analogy does an excellent job of intuitively explaining that most of the informative aspects of the data are kept (we can still see just fine what's in the image) while irrelevant details are discarded, and that is probably what was intended.
If anything comes off as disingenuous in that context it's your representation that it's like a strong reduction in the pixel domain (where it does indeed destroy a lot of the information). What can be done is much more like running the picture through a high-performance Imagenet classifier and keeping the 20 (or 2048, or whatever's needed) most informative values at a level that corresponds strongly to semantic content of the picture, and holding on the model.
We could probably generate images that people would have a hard time distinguishing from the original with that.
You're making lots of arguments by analogy here, and they're all just not correct. I'm not sure how better to explain it. Yes, it is theoretically possible to do a style of dimensionality reduction that would not destroy very much information. But nobody uses models like that to make predictions. The models people actually use to make predictions destroy enormous quantities of information, and reduce dimensionality in the extreme. It is not like compressing a JPEG. It is like looking at a photograph of a person and remembering that someone with brown hair was in it.
Analogy does not work here and is misleading. You cannot do much if anything with 20 most representative pixels (if there is such a thing) but you can infer highly valuable characteristics about the person. Yes, you cannot recreate the original data but what you end up is potentially much worse (sensitive/private) than the original data.
Unless the data is completely random it's not crazy to say that the data can be reconstructed from a reduced version.
If you have a million points that largely fall on a 3-dimensional line and you project that into 2 dimensions, you can easily recover that lost dimension with losses relative to the deviation. And that loss may not even matter depending on the kinds of data and margins of error you're working in.
This is actually a nice illustration of the central problem with this argument: the more personally identifiable a piece of information is, the less recoverable it'll be, and vice-versa. If all of the points of data are on some n-dimensional line, then obviously all of them can easily be recovered, but knowing all those things about a person doesn't actually tell you any more about them than knowing just one of those things. Conversely, if the points of data are very random then it'll only require a handful of points to uniquely identify a person and find the entry in the original data set with all their other information, but dimensionality reduction will have to throw that data away - you simply won't be able to recover that information from the model. (We actually know from the literature on de-anonymization that a lot of data falls into the second category.)
How many dimensions were they working with and how much variance and correlation was there in the features? What's the margin of error for the end product?
To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.
What they have done is distill some insights about people from this data. It's arguable whether they should be allowed to keep those insights, but there's no privacy risk there really.
It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.