Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's only accurate in the sense that because an LSTM's hidden layer is much smaller in dimension than the data on which it is trained, there is less information in it.

However, it concisely represents a manifold in a much larger dimensional space and effectively captures most of the information in it.

It may be (and is) lossy, but don't underestimate the expressive power of a deep neural network.



You're throwing out buzzwords instead of addressing the response.

It's dimensionality reduction. You cannot recover the original object. It's like using a shadow to reconstruct the face of the person casting the shadow.

Note this has nothing to do with the expressive power of a deep neural network. You are by definition trying to throw away noisy aspects of the data and generalize a lower dimensional manifold from a high dimensional space. If it's not lossy, it won't generalize.


You're right that it's really just a form of dimensionality reduction. My point was just that it's a more powerful form of dimensionality reduction than PCA or NMDS.

[Edit: and that the salient characteristics are likely contained in the model.]


Precisely because it's more powerful, it doesn't encode the identifying information of the original data. Something like PCA likely would retain identifying characteristics (depending on how many low-rank vectors you drop).


Outside of the fact that they have identities for all of the people whose data they acquired, yes, it would be harder to reconstruct individual people with it than PCA because of the direct interpretability of its data.


They claim to have deleted that data. If they haven't deleted the data, then of course it's still an invasion of privacy. But the ML model really has nothing to do with it.


I think the ML model has a lot to do with it in this case. One of the arguments I expect to see is that "Oh, no! We removed all the data. It's gone. I mean, that was only a few hundred megabytes per person anyway, but we just calculate a few thousand numbers from it and save in our system, then delete the data. That's less data per person than is needed to show a short cute cat GIF. What harm could we possibly do with that?"


My point isn't that there is no harm here in them storing this model. It's also not that the data in their model is worthless. It's specifically that the way this article is talking about the issue is incorrect. The analogy they use would lead you to draw false conclusions about what's going on, and how to understand it.

There is a real issue here of whether or not they should be allowed to keep a model trained from ill-gotten data. But the way I would think about it is: If you steal a million dollars and invest it in the stock market, and make a 10% return, what happens to that 10% return if you then return the original million? That's a much better analogy for what's going on here. They stole an asset, and made something from it, and it's unclear who owns that thing or what to do with it.


The ML model might know more about me than I’m willing to admit about myself. I only find some — but not much — comfort in the proposition that it can’t conjure my PII.


Is this basically a choice between .mp3 and .ogg, png vs jpg vs gif?


It’s kind of comparable.

Regardless, I still think having the most relevant features already extracted is all they need to ask many of the questions they might want to. The point is that that’s still quite bad.


Right, I was just trying to confirm an analogy. It seems like this stuff is like a lossy codec for traits.


If you can still run Java applets, this is a nice intro: http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html


It's dimensionality reduction. You cannot recover the original object.

Makes me think of the Simulacrum[1]. "The map is not the territory."[2]

1. https://en.wikipedia.org/wiki/Simulacra_and_Simulation

2. https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation


Your SSN is "dimensionality reduction" over your data. It's still your private data. Same for your race, sexiual orientation, hobbies, etc


I don't think you understand what dimensionality reduction means.

SSN is a lookup key into the raw data. Dimensionality reduction is by definition lossy since it's used in scenarios where: rows of data = n <<< m = number of features


Only if the "true" data actually lives in a lower dimensional manifold and the data acurrately can encode it with low noise. I doubt anyone can tell who you will vote for depending on which cat videos you liked, no matter how magic your regressor.


I do think that the most significant components of a personality will likely be targetable with a relatively low nuclear norm.

And, for example, where someone's proclivity on the exploration/exploitation spectrum, if you will, (IE, how strongly do they respond to fear-based messaging) falls is probably quite predictable from a spectrum of likes.

Cat pictures may be less informative, but not all of these people clicked exclusively on feline fuzzy photos.


I'm not an expert on personality so won't disagree (except to say that I am a little sceptical of a static personality profile actually existing and I think people who always vote a certain way would be the easiest to regress and also the most useless to target). As I said in another post, it really depends on what part of your privacy you are trying to protect. It is also a mistake to think of anything on the internet as a private forum.


the most significant components of a personality will likely be targetable with a relatively low nuclear norm.

Is this falsifiable? It reads like a tautology to me.


I think it is falsifiable. More precisely, the claim is that the most significant features for psychoanalytic purposes will be contained in the model after training even with low nuclear norm. It’s possible for the most salient features for this purpose to not be in the model. It was unclear the way I first said it.


Alternatively, "If I take a FLAC you own, make a 320kbps MP3 from it, and store it on my laptop, am I still in possession of any IP belonging to you?"


I think a better analogy might be "If I take a few hundred thousand MP3s and come up with a clever way to reduce each to a short representation of its genre, mood, tempo, etc that can be used to identify similar music, then throw away all the original MP3s, am I still in possession of the original music". The whole point is to turn the individual data into broad, general categorisations that are easier to handle because they contain much less information. Remember, they're using this for ad targeting, and the reason they're doing it is so they can target broad groups of people rather than having to manually go through and target ads at each individual one by one.


I like that analogy. I'll make it more tenuous with - "I took a copy of your album collection without your permission, ripped them to MP3, played them so much everyone is sick of them. but you've still got all the original CD's you don't even use, so no problem right?"

On this tangent, IP ownership for deep learning models is interesting - how to you prove (in court) someone has/hasn't copied model/stolen a training set? If you fed someone else's training/model into your system, how easy is it to prove? Will we see the equivalent of map 'trap streets' in trained CNN models?

Which led me to: https://medium.com/@dtunkelang/the-end-of-intellectual-prope...


Except Facebook let them take a copy of the album collection, albeit for a different use case, but it was allowed nonetheless. That doesn't absolve CA in any way, but should make us wary of people we willingly give our "album collections" - they will use them to make money, and what they allow people do with them can easily be things we don't agree with, but didn't have the imagination to think of when we signed the EULA.


Where do neural nets come into this? The Koscinski-Stillwell-Graepel paper talks about using the reduced-dimensionality data with logistic regression.


Especially when the dataset probably isn't that high in entropy. Something like PCA can drop the dimensionality by significant amounts as long as the data has enough clear signals in it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: