> To be precise, you almost certainly cannot use this data to recreate anything ...

> To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.

First off, I think that's wrong. The idea is after all to keep the information that will result in the smallest error compared to the original on the dimensions one cares about. Within what the model emphasizes a reconstruction can be not only "remotely resembling the original dataset" but as closely resembling the original dataset as is possible with the capacity of the representation.

Next, I'm really not talking only about the particular method described in the post. It's definitely possible to choose to make a light enough reduction to preserve the aspects of the information one is interested in, and to optimize for recall rather than generalization. A more realistic context is going to be that some information about the affected individuals is still exposed or kept (maybe in a compact derived form), which would in many cases give excellent possibilities to restore information accurately enough that claims to have the removed the data are effectively deceptive.

Even for cases where the models are in good faith created only to "distill some insights" I'm skeptical that they really are useless for recovering individual information. I'm by no means an expert in differential privacy but I do listen when it comes up, and a lot of what we see from that field seems to come down to being able to trade off the relation between keeping the data useful and how many pieces of additional information (or assumptions and brute force) are needed to break the integrity protections. With surprises that tend to be on the side of 'Oops. Turns out this clever trick can recover the originals easier than we thought.'

> It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.

In my honest opinion the original analogy does an excellent job of intuitively explaining that most of the informative aspects of the data are kept (we can still see just fine what's in the image) while irrelevant details are discarded, and that is probably what was intended.

If anything comes off as disingenuous in that context it's your representation that it's like a strong reduction in the pixel domain (where it does indeed destroy a lot of the information). What can be done is much more like running the picture through a high-performance Imagenet classifier and keeping the 20 (or 2048, or whatever's needed) most informative values at a level that corresponds strongly to semantic content of the picture, and holding on the model. We could probably generate images that people would have a hard time distinguishing from the original with that.