> To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.
First off, I think that's wrong. The idea is after all to keep the information that will result in the smallest error compared to the original on the dimensions one cares about. Within what the model emphasizes a reconstruction can be not only "remotely resembling the original dataset" but as closely resembling the original dataset as is possible with the capacity of the representation.
Next, I'm really not talking only about the particular method described in the post. It's definitely possible to choose to make a light enough reduction to preserve the aspects of the information one is interested in, and to optimize for recall rather than generalization.
A more realistic context is going to be that some information about the affected individuals is still exposed or kept (maybe in a compact derived form), which would in many cases give excellent possibilities to restore information accurately enough that claims to have the removed the data are effectively deceptive.
Even for cases where the models are in good faith created only to "distill some insights" I'm skeptical that they really are useless for recovering individual information. I'm by no means an expert in differential privacy but I do listen when it comes up, and a lot of what we see from that field seems to come down to being able to trade off the relation between keeping the data useful and how many pieces of additional information (or assumptions and brute force) are needed to break the integrity protections. With surprises that tend to be on the side of 'Oops. Turns out this clever trick can recover the originals easier than we thought.'
> It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.
In my honest opinion the original analogy does an excellent job of intuitively explaining that most of the informative aspects of the data are kept (we can still see just fine what's in the image) while irrelevant details are discarded, and that is probably what was intended.
If anything comes off as disingenuous in that context it's your representation that it's like a strong reduction in the pixel domain (where it does indeed destroy a lot of the information). What can be done is much more like running the picture through a high-performance Imagenet classifier and keeping the 20 (or 2048, or whatever's needed) most informative values at a level that corresponds strongly to semantic content of the picture, and holding on the model.
We could probably generate images that people would have a hard time distinguishing from the original with that.
You're making lots of arguments by analogy here, and they're all just not correct. I'm not sure how better to explain it. Yes, it is theoretically possible to do a style of dimensionality reduction that would not destroy very much information. But nobody uses models like that to make predictions. The models people actually use to make predictions destroy enormous quantities of information, and reduce dimensionality in the extreme. It is not like compressing a JPEG. It is like looking at a photograph of a person and remembering that someone with brown hair was in it.
First off, I think that's wrong. The idea is after all to keep the information that will result in the smallest error compared to the original on the dimensions one cares about. Within what the model emphasizes a reconstruction can be not only "remotely resembling the original dataset" but as closely resembling the original dataset as is possible with the capacity of the representation.
Next, I'm really not talking only about the particular method described in the post. It's definitely possible to choose to make a light enough reduction to preserve the aspects of the information one is interested in, and to optimize for recall rather than generalization. A more realistic context is going to be that some information about the affected individuals is still exposed or kept (maybe in a compact derived form), which would in many cases give excellent possibilities to restore information accurately enough that claims to have the removed the data are effectively deceptive.
Even for cases where the models are in good faith created only to "distill some insights" I'm skeptical that they really are useless for recovering individual information. I'm by no means an expert in differential privacy but I do listen when it comes up, and a lot of what we see from that field seems to come down to being able to trade off the relation between keeping the data useful and how many pieces of additional information (or assumptions and brute force) are needed to break the integrity protections. With surprises that tend to be on the side of 'Oops. Turns out this clever trick can recover the originals easier than we thought.'
> It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.
In my honest opinion the original analogy does an excellent job of intuitively explaining that most of the informative aspects of the data are kept (we can still see just fine what's in the image) while irrelevant details are discarded, and that is probably what was intended.
If anything comes off as disingenuous in that context it's your representation that it's like a strong reduction in the pixel domain (where it does indeed destroy a lot of the information). What can be done is much more like running the picture through a high-performance Imagenet classifier and keeping the 20 (or 2048, or whatever's needed) most informative values at a level that corresponds strongly to semantic content of the picture, and holding on the model. We could probably generate images that people would have a hard time distinguishing from the original with that.