I feel inclined to say "... well yeah, obviously".
Not in the "obvious in retrospect" way, but because browsers have been progressively blocking history-sniffing tactics for years precisely because advertisers were using it to identify visitors.
Did this research... establish better numbers around it or something?
> Did this research... establish better numbers around it or something?
>> However, this time around, since the data was collected from Firefox itself and not through a web page performing a time-lengthy CSS test, the data was much more accurate and reliable. Furthermore, the data Mozilla researchers collected is also about the same type of data that today's online analytics companies also collect about users — either through data partnerships, mobile apps, online ads, or other mechanisms.
Clearly not actually anonymous browsing data in actuality though... which is why we should always take claims that telemetry data is anonymized with a grain of salt.
HN is on the internet, no? Also, HN definitely has a particularly bad headline-only problem, or maybe it just shows worse than some other places because people here have a tendency to ask really basic questions that the article clearly answers.
I think the problem here is that HN has higher standards (and we should keep it that way). Reddit is far worse, but I don't want to deal with all the stupid there.
Does HN have higher standards? Maybe different standards, but I don't know that when it comes to reading the article that HN is much better than Reddit.
I regularly visit maybe 5 or 6? The rest tend to be random links from reddit or HN, I wonder if visiting a site like that once and never again is enough to help with that identification.
Another thought is I think it's obvious that if it's a site you log into and the URL has an identifier of some type then it's easy to identify you and that's why schemes to hide the URL could also be a privacy issue.
Consider for example, that many pages use remotely loaded resources.
I would think things like Facebook/Twitter like buttons or Google Fonts might make it to assemble this history.
Sites like FB are said to maintain "Shadow Profiles" of people, even when those people aren't using their service directly.
I suppose in theory any sufficiently shared infrastructures such as AWS/Cloudflare could do so as well, but they are disincentivized to do so.
Would using Firefox's 'Containers' help prevent this? As far as I understand they quarantine the Facebook pages so they can't get data from other websites you visit.
I think only indirectly, but if they control the endpoint they can ping you back, subtract rtt from initial request response time and then the difference from that can tell them whether initial request was cached in dns or not.
Just so I understand correctly, does that mean you then need to control the end point of every site you want to use as part of fingerprinting?
If so, wouldn’t that drastically reduce the effectiveness of using DNS resolve times as a work around for Firefox containers?
Not trying to be argumentative here, just trying to understand how effective the sandboxing is, or whether I need to design more layers of indirection. :)
Have there been any indications that AWS broadly captures connection data between AWS tenants and their respective users for illegitimate purposes?
Some AWS services (such as TLS-terminating load balancers) do have access to sensitive cross-site information that could be fed into the adtech panopticon but I wonder if it would be cost-effective for AWS to gather.
I doubt it would be cost effective for AWS to do broad captures for all of its services, however. There's probably not much value in slurping up the IP and SNI data for all HTTPS requests to every EC2 instance, for instance.
Malicious extensions are a likely culprit. This is the ultimate irony of the whole WebExtensions debacle; browser vendors wanted to stop the extensions from interacting with the browser because maintaining that interface is work, so now the most trivial extensions will request full access to all websites so they can inject scripts. To bring back "backspace navigates back" I have an extension that needs just that.
Needing javascript that embeds in every page for basic mouse and keyboard behaviour is insane. No clue why they decided it should be the only viable option.
Fine, XUL had to go. But where is the replacement? How many more years should we expect Mozilla to need to implement configurable bindings? It doesn't even need to be extension-accessible, just give users a tab in the preferences menu like damn near any other application has been doing since the dawn of GUIs.
I am very much used to alt+left arrow to navigate back, in the unlikely case you were not aware of this shortcut and if you would like to drop this extension for whatever reason.
Hmm this is a bit of an interesting question. The original study (2012) exploited a security bug, which let anyone see which sites you had visited. (Basically, by checking the color of links with JS to see if :visited styling had been applied.) That bug doesn't exist anymore, and the new survey just uses opt-in data to "confirm" it.
So, I don't actually think this research is particularly relevant anymore? It can't really be exploited (and when it can, there's much better ways to track the person).
Anyone with a widely distributed analytics package or tracking beacon can track your hits on pages with that beacon. How many pages DON'T use Google Analytics or a Facebook 'like' button?
Does Mozilla Pocket get browsing history if it isn't disabled? Last I heard it's not using E2E encryption and Mozilla still hasn't open sourced the server side of it.
My understanding of pocket recommendations is that it gets a list of articles from a server every day and uses a local algorithm to match them to your browsing history so your history never leaves the device. Idk if any metadata about which decisions it made is leaked though.
I'm under the impression that syncing browsing history between instances of Firefox is a feature Mozilla provides through Pocket, but admittedly I don't have first-hand knowledge of this.
I think syncing is done with a Firefox account (Firefox Sync) and i can't find implementation details, but I did find:
"Firefox Accounts uses your password to encrypt your data (such as bookmarks and passwords) for extra security. When you forget your password and have to reset it, this data could be erased. To prevent this from happening, generate your recovery key before having to reset your password."[1]
So it appears they may be encrypting data locally and syncing encrypted data without having keys.
I think you are right though, there are more website saving features available through pocket other than recommendations and I'm not sure how any of that works.
That's hardly surprising. I mean browsers hand out willingly plenty of information that could be used for pretty accurate identifications. Just scrolling through my scores on amiunique[1], many of the parameters put me in the 0.01% category.
Using Mac OSX stock audio input and output devices is already supposedly "unique".
Having an Azerty keyboard supposedly puts you in the 0,04%
category even though all French speakers have the same setting which means that 0,04% already represents +70M users. So far from "unique".
Congratualations on never actually bothering to block JS and find out - you know, facts. From actually doing so over many years, and so from actual experience I'd say completely non-functinal sites are about 25%.
I’d put the number quite a bit lower than that, probably comfortably under 10% of sites I interact with, though the trend is definitely upwards, drastically so among interactive things (which are probably worse than 50% broken these days).
To this me and a friend started sketching on a VPN/HTTP proxy that will have a set of say 100 outgoing IPs, look at the domains being connected to and distribute request destinations over IPs.
So e.g. Google would always see the same IP, which would be different from the one Facebook sees.
While access times cross-references and identification is still theoretically possible, it should be an entirely different game.
Would anyone else reading this be interested in working on this or joining in? I'm not thinking to make it a startup or business per se but 1) reliable IPs are a bit too expensive to make sense for just 1 person 2) anonymity in numbers.
I'm thinking ideal would be something FOSS and easy to self-host and replicate so you can pool together a group of friends for a shared VPN among semi-trusted parties (at least the user should trust the operator to not index requests and sell the data, and the operator should trust users to not run botnets)
I think an easier approach is that once you have good IPv6 connectivity you could do something like a unique address per day per host. Every device could have 100M ip addresses and it wouldn't touch the IPv6 address space (10 billion humans * 100 devices = 0.000005% of the IPv6 address space).
Edit: My math is wrong. I thought IPv6 was 2^64, but it's actually 2^128, so that percentage is 10^20 times more miniscule.
You get a range from your ISP (e.g. a /64). Everything within that range would be tied to "you" (rather, your connection, but something like user agent would tell its your wife's iPhone or your MacBook Pro).
Yes, but the primary identifier is IP address. The detailed profiles built with fingerprinting and other data are attached to the small set of IP addresses a person uses over the course of her lifetime. Most internet users have limited choice when it comes to internet access. A user cannot change her IP address with the same ease as she can change her software fingerprints. If a company is trying to sell online ad services, then having a database of browser fingerprints purportedly representing real people is not very valuable unless the company can link those fingerprints to real physical locations.
Given the prevalence of CGN, especially in mobile / cellular internet, and the reality that mobile is first for a large number, the use of IPs as a primary key feels less likely these days than a decade ago.
> A user cannot change her IP address with the same ease as she can change her software fingerprints.
I dunno. It’s a lot easier for my less techie friends to reboot their router and get a new IP than it is to talk them through installing some privacy enforcing software they required regular maintenance or results in weird and wonderful breakage of their favourite websites.
Sounds like you are describing two different scenarios: 1. connecting to cellular networks when away from home/office and 2. connecting to internet routers at home/office.
Don't take my word for it, read the work cited in the article. Note how much they still rely on (static) IP addresses. If we removed the IP address as a reliable item of available data, based on observed practices (not theory), that would likely be significant.
"Mishra et al. demonstrated that IP addresses can be static for a month at a time [42] which, as we will show, is more than enough time to build reidentifiable browsing profiles."
"Secondly, ground truth was established based on reidentifying visitors with a combination of IP Address and UserAgent, perhaps biasing the baseline data to under-represent users accessing the web from multiple locations."
"Even if traditional stateful tracking is addressed, IP address tracking and fingerprinting are a real concern as ongoing privacy threats that can work in concert with browser history tracking. We point readers to Mishra et al.'s [42] discussion on IP address tracking and possible mitigations. They observed IP addresses to be static for as long as a month at a time, and while not a perfect tracker, IP addresses are trivial to collect."
How static/deterministic are the CGNAT translations though? It is conceivable that when client A connects to a Facebook service with IP X and port P that the source IP and port observed by Facebook is always the same.
In any case your ISP is probably logging all your DNS queries and all their dynamic NAT translations to a database, so couple REMOTE_ADDR with REMOTE_PORT and a timestamp and you can almost certainly be identified.
As to how the translations are occurring, I've never actually managed a CGN platform myself, but based on my knowledge of other hardware, I suspect you're closer to the reality than I was, and it's likely that a SRCIP always results in the same TRANSLATED SRCIP, as that can then be installed in hardware trivially and no longer needs to traverse the punt/cpu path to lookup what the translation needs to be.
That does leave the system open to abuse though, depending on how quickly entries age out, as a single customer could easily open up 65k sockets in a very short span of time, effectively DoS'ing any other customers who are using the same TRANSLATED SRCIP if there are no free TRANSLATED SRCPORTs left that their translation can bind to. Then again, the risk of this could be perceived to be low, with a AUP that can handle this if it turns out to be a social rather than technical problem, so it could still be happening in the wild anyway.
This remains a good reminder for me to avoid speculating about topics I haven't thought too deeply about!
Oh absolutely, it's not a silver bullet - just an attempt at alleviating that single dimension, which I still think is significant enough to take seriously.
IMO there will never be a complete solution but that means we have to tackle each issue or dimension individually within the larger context, not just throw our hands in the air and give up.
Maybe should have been more clear on the scope ambition in the OC but can't edit the comment anymore.
It sounds to me like your threat model here is that two sites (A and B) which would like to share identity, and you want to stop them? Perhaps A has identity (for example, you log in), A gives you links to B, and B runs some third-party JavaScript served from A. For example, A could be FB/Google/etc and B could be a news site. Site A can add any query parameter it wants to the outgoing link, and then parse it on B in their third-party JavaScript. If they always used the same params (ex: fbclid/gclid) it would be easy to detect and block, but if they were trying to get around the blocking it would be easy to rotate these parameters as often as they wanted because the same entity (A) controls both the producer and the consumer. Now your two identity bubbles have been joined.
(Disclosure: I work on ads at Google, speaking only for myself)
It's definitely an arms race, and IMO it makes sense to push back on areas where one can. Are you implying that it's not a worthwhile effort and that the battle is lost?
I'm thinking 1) DNS control with block lists 2) browser extensions (restricting canvas, removing tracking parts of urls) 3) be restrictive of disclosing PIIs 4) IP obfuscation along the lines I laid about above should make it a lot less deterministic and decrease confidence in merging of datasets.
Rule lists obviously have to be continuously updated.
Only a sith deals in absolutes but from your perspective am I missing something?
Here in the UK, date of birth and post code is enough to identify something like 95% of people. Anonymised data sets are not really possible once you have more than a few varriables. Most people don't know this.
My local area published "Anonymised" datasets of public transport usage but they gave everyone a unique ID. It was found that if you knew 2 trips the person took you could uniquely identify the person in the dataset and see all of their trips.
Of course that case will fail but for almost all cases if you know something like that you took x bus to work and then a week later you took one to the mall its now possible to find all of their trips. For someone you know somewhat well its not hard to find 2 trips they took and then be able to find all of their trips.
Ok, so two trips but not any two trips. It requires a lot more knowledge of the person. For someone you know well why would you even need to look at the data?
The thing is, if you know enough trips to unmask them, then you can find out about all their trips in that dataset.
As an employer, maybe I can find out an employee wasn't home sick when they said, but took the bus to a station that only serves a competitor's business. Etc.
Its almost any two trips. Its the exception that two people take the same trip together. I can think of a handful of people I could eventually find 2 trips for who wouldn't want me to have their entire travelling history.
First, postcode is something you give out pretty willingly. If you put your postcode and dob into an insurance quote website, they would no longer be insuring based on a pool of people like you. They'd literally just see how many claims you had. And also what ethnicity and sexuality and 50 other personal, irrelevant criteria they want.
The second is that postcode is only a narrow or broad measure depending on what you're using it for. If you want to do a study on asthma rates vs road traffic, postcode is just right, anything more general and you're comparing side streets and motorways. So it makes sense for that data to be available. But wait, as the data user, I only need one more data set (say voter registration, already available) and I can literally look up you're medical history before deciding whether to hire you.
This is the issue here: data HAS to be specific to be useful. But ifs its specific its dangerous. AND data is much more specific than you realise because a few innocent sounding data points are unique to you when combined.
In Ireland, we use Eircode with one house per postcode. This is very handy because you don't need to type in your full address on a lot of websites, just the Eircode.
The first three digits of an Eircode are more like a traditional postcode in that they indicate your area/town but the next four are randomised for each address.
When you consider the birthday paradox, and consider demographics, it probably doesn't.
If you assume that DOB's are evenly and randomly distributed over the last 100 years (1 of ~36500 values) then the probability of none of 100 people sharing a DOB is only ~87%. If you tuned for demographics the true stats would be much worse.
That said, you probably have anti-clustering aspects - parents obviously can't share birth dates with their children, and siblings can't either (unless twins). But! couples tend to be of a similar age...so, tricky.
50-100 houses is not that much. You can find someone in those 100 houses with 1 DOB and post code. Maybe in very crowded areas you'll need one more var. Worst case scenario you'll end up with 2-3 people.
Intuitively there are tons of things we do on our computers that uniquely identify. I am sure the adware companies know a ton more and are not public too. The need for strict privacy preserving tech is needed across the whole stack.
By looking at all the data available to untrusted sites (as seen in https://amiunique.org/fp) you can tell that Web is many many years away from being privacy conscious. List of fonts, canvas fingerprinting, timezone, OS, user agent... the list goes on and on. Those of us who are tech-literate know better than to create tech like this today, but there's just too much momentum (and shady interests) to hot-swap Web for something else.
Wasn't it shown by aol researchers 20 years ago that search histories are uniquely identifying? If so, this seems hardly surprising, as browser history should be a superset of search history.
"TrackMeNot runs as a low-priority background process that periodically issues randomized search-queries to popular search engines, e.g., AOL, Yahoo!, Google, and Bing. It hides users' actual search trails in a cloud of 'ghost' queries, significantly increasing the difficulty of aggregating such data into accurate or identifying user profiles.
"
I use it as far as I can but it's stopped working in palemoon. The queries it produces aren't very intelligent when you see them and it wouldn't take much NSA/MI5 work to trim much of them out.
I noticed the other day that various chatbots (as in, a single service shared across multiple websites) call me "The University of Texas at Austin", presumably because I have a housemate who works there.
I tried various VPN servers and got called by other company names[0]. It was a good reminder about how we're tracked, and our information may be shared, even with other users.
I suspect privacy would be better served by taking the approach of the security domain with responsible disclosure to vendors and a concerted effort to attack the problem holistically. Until then we’re just giving privacy attackers a heads up and by the time this issue is mitigated their onto the next avenue for bypassing privacy.
If your random pages are a, b and c but my pages are d, e and f or even a, b and d then it’s still easy to fingerprint us.
Extensions like this might work if they visited the same sites all other users visit. Otherwise you’re just adding even more unique information for the trackers.
But if both our random pages are a, b and c, and the only difference is when or how often I accessed each of those, then making it random for both of us will effectively turn us into the same person.
What about all the other pages you visit? How does adding random traffic to your history make you any harder to identify? It just creates more datapoints.
If the study establishes that for all practical purposes, online anonymity is impossible to maintain for average users, what are the implications (a) for the average user; (b) for the economy; and (c) for society?
Not in the "obvious in retrospect" way, but because browsers have been progressively blocking history-sniffing tactics for years precisely because advertisers were using it to identify visitors.
Did this research... establish better numbers around it or something?