Hacker News new | past | comments | ask | show | jobs | submit login

Unless I'm missing something, this seems like an incredibly long winded way to check the users IP location?

For example, connecting to a VPN and checking https://cloudflare.com/cdn-cgi/trace gives me `colo:CPH` (Copenhagen) which is far from my nearest CF datacenter (geographically), closer to the IP location from my VPN provider (Oslo) but still not particularly close?

If I don't use a VPN, I don't even get the capital city of my country (which I'm in right now), I get a colo approx 250 miles north. So I also dispute that Cloudflare always returns the "nearest available datacenter".

Don't get me wrong, the write up is cool and certainly interesting - just not convinced on the real world applications here...




> Unless I'm missing something, this seems like an incredibly long winded way to check the users IP location?

It's less accurate than that. IP Geocoding can be down to the city level in many cases. This is _maybe_ nearest cloudflare data center


>just not convinced on the real world applications here...

As a piece of data alone, the results are probably not of significant use.

The real-world application (and potential danger) is when this data is combined with other data. De-anonymization techniques using sparse datasets has been an active area of research for at least 15 years and it is often surprising to people how much can be gleaned from a few pieces of seemingly unconnected data.


> The real-world application (and potential danger) is when this data is combined with other data.

That's exactly the point. In this case it's only really possible to de-anonymize people who take long distance trips. But based on two data points it might be possible to know which flight or train a person travelled with.

With three different data points it might be quite unique. For example you might find out somebody travelled from Italy to Norway on Monday evening and then to France on Wednesday morning. There are probably not so many people who did a trip like that, it might come down to only one (or a handful) people who fits this itinerary. With other data sources it might be possible to uniquely identify this person.


>The real-world application (and potential danger) is when this data is combined with other data. De-anonymization techniques using sparse datasets has been an active area of research for at least 15 years and it is often surprising to people how much can be gleaned from a few pieces of seemingly unconnected data.

Seems pretty handwavy. Can you describe concretely how this would work?


>Seems pretty handwavy.

It has a whole Wikipedia article and everything.

https://en.wikipedia.org/wiki/De-anonymization#Re-identifica...

>Can you describe concretely how this would work?

Here's one of the earlier papers I remember off-hand, demonstrating one methodology. New (and improvements to existing) statistical techniques have happened in the ~18 years since this was published. Not to mention their is significantly more data to work with now.

https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

From the Wiki I linked:

"Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them." [...] "A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts."

Point being that operational security is hard, and it takes a lot less to "slip up" and accidentally reveal yourself than most people think. Obtaining a location within 250 miles (or whatever) can be a key piece of information that leads to other dots being connected.

Other examples (albeit with less explanation) include police take downs of prolific CSAM producers by gathering bits and pieces of information over time, culminating in enough to make an identification.


>"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

> [...]

"Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them." [...] "A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts."

The only reason the two attacks work is that you have access to a bunch of uncorrelated data points. That is, ratings for various shows and their dates, and cellphone movement patterns. It's unclear how you could extend this to some guy you're trying to dox on signal. The geo info is relatively coarse and stays static, so trying to single out a single person is going to be difficult. To put another way, "guy was vaguely near New York on these dates" doesn't narrow down the search parameters by much. That's going to be true for millions of people.


>To put another way, "guy was vaguely near New York on these dates" doesn't narrow down the search parameters by much.

That's why I said that this data alone is probably worthless, but can gain value when combined with other data.("As a piece of data alone, the results are probably not of significant use")

The combining of data is the important bit and the entire emphasis of both of my other comments.

Two pieces of otherwise anonymous data can, when combined, lead to re-identification.


>Two pieces of otherwise anonymous data can, when combined, lead to re-identification.

How are you going to get more anonymous data? Practically speaking if your target has such poor opsec that he's hemorrhaging bits of data, you probably don't need this attack to deanonymize them.


>How are you going to get more anonymous data?

All over the place? Your comment history here (and mine!) is full of data. Each piece alone isn't identifying, but there's a good chance that in aggregate it is.

If you share that username on discord/twitter/reddit/steam/whatever, that's even more data. If you reference old accounts anywhere, you guessed it, even more.

>you probably don't need this attack to deanonymize them

My comment wasn't necessarily specific to this attack, just noting that this attack can be an additional piece of data in the chain of re-identification.

You've gone from "not convinced on the real world applications here" to "how are you going to get more anonymous data". If we assume that you can get some data somewhere (a small list of example sources above), can we agree that there is, possibly, a real world application?


Do you not buy that a user's IP location needs to be protected?

There is a reason applications go to so much effort to proxy requests to resources such as images. It's not free to do this.


Having your IP address not revealed to people that can message you on Signal seems like a pretty reasonable privacy expectation.


Your IP isn't revealed though, only your vague geographic area.


That's marginally better, but can still be a problem. Just consider e.g. a whistleblower working for a company with a very small satellite office in a given country.


Did you even read it? There's no IP leak. And if you're a high target, then using some kind of proxy is literally the first step you take. The attack is nothing but an exaggeration and has no merit in real world


Yes, I read it. Information about your IP address is leaked, as that's how Cloudflare routes you to a given datacenter.

And I strongly disagree that being able to uncover somebody's rough geographic location is not a privacy problem.

I wouldn't be surprised if this, for example, lets you deduce if somebody is currently home, at work, or commuting (as all three ISPs might be hitting different Cloudflare datacenters). That's not information everybody is comfortable broadcasting to the world.


If you aren't comfortable broadcasting it, then maybe take measures so that it doesn't get to that point. Privacy is not by default, ever


To quote Signal themselves:

> Privacy isn’t an optional mode — it’s just the way that Signal works. Every message, every call, every time [1]

While I don't consider this a critical bug requiring an immediate technical remediation from Signal, this should definitely be either fixed or called out in the documentation at some point.

[1] https://support.signal.org/hc/en-us/articles/360007320391-Is...


The sentence before the one you quoted gives the essential context:

> Signal conversations are always end-to-end encrypted, which means that they can only be read or heard by your intended recipients.

They're not saying that it is an anonymisation proxy, they're saying the messages and calls are encrypted for the recipient rather than to the server


They also use AWS so good luck using it on your actual IP


Privacy by default is Signal's entire brand


Weather predictions are the weather channel's entire brand, but people understand the concept well enough to know that this doesn't mean it's infallible. There is a limit to how many warning stickers we need in the world. If you want to rely on a particular feature, maybe check that the product supports said feature. Signal does encryption, not onion routing


I guess it can be useful for tracking fugitive political dissidents, terrorists, etc. If you can narrow their location down to 250 miles, it's already very useful information. And without raising any suspicions.


It's not really narrowing it down to 250 miles; its narrowing it down to a circle whose radius is at least 250 miles or ~196,000mi^2.

My closest Cloudflare CDN is just listed as "DFW". The DFW metro area is about 8,700mi^2, and I imagine I could be even further than the "metro area" and still get the "DFW" Cloudflare datacenter.

In their little video animation, the area inside the overlap of those two circles encompasses several states. The edges of the two circles go from Washington to Florida and almost include Chicago. The target could have been in Denver or St Louis or Las Vegas or Phoenix or San Diego or San Francisco or Amarillo or El Paso.


I think it's still useful. Going from "we don't know where Osama bin Laden is at all" to "he's somewhere in Pakistan".


If only we knew OBL's Discord handle then we would have known he was about where we figured he was all along...

And then this whole thing gets thrown off if one uses a VPN with an endpoint somewhere other than where you are. Click a button, suddenly my datacenter is AMS. Click it again, suddenly its OTP...


>If only we knew OBL's Discord handle then we would have known he was about where we figured he was all along...

Discord is just an example, this can apparently work with many apps that store user attachments on Cloudflare.

>Click a button, suddenly my datacenter is AMS. Click it again, suddenly its OTP...

Well, if the location keeps changing, it's obvious it's not their real location. But if it’s always the same, no matter what, that’s a huge clue. Of course, this works best when you’ve got some other data to back it up. It’s kind of like playing Akinator - the more answers you get, the closer you get to figuring out the target. One answer might not tell you much, but three or four?


In their example target it pinged two datacenters, one in Dallas and on in San Franciso. Their requests might bounce between datacenters even if they aren't on a VPN.


This assumes that Osama bin Laden has poor enough opsec that he's using (eg.) Discord without a proxy. State actors have much more sophisticated techniques available.

(It's still an interesting vector, though! But it's true that the headline and writeup are a bit sensationalized.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: