Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Millions of email addresses leaking to advertising and analytics companies (medium.com/thezedwards)
169 points by aspenmayer on May 1, 2020 | hide | past | favorite | 62 comments


‘One important trend to notice is how often Google Analytics, Google’s DoubleClick, Facebook, and Twitter are ingesting the user emails — these are organizations that should be receiving deletion requests en-masse and they should all have processes to handle this type of effort already (Facebook likely has this tech already based on conversations on this research and additional research from a private report from several years ago).‘

‘This type of email user data in a URL bar synced into Javascript pixels is most typically blocked by a regular person through “Ad blockers” or through browsers like Safari, Brave, and Firefox — those browsers use Javascript/cookie blocking as a default features to protect users (each browser handles it slightly differently). This breach and research included here would impact all Chrome users of these websites who went through these specific user flows and who didn’t proactively block all Javascript (a rarely used option) or use a Chrome “Ad blocker” extension that blocked this type of Javascript. Some people using the other “safe” browsers (Safari/Brave/Firefox) could have been protected from the leak due to their 3rd party Javascript requests being blocked.’

Original title too long. It was: The 2020 URL Querystring Data Leaks — Millions of User Emails Leaking from Popular Websites to Advertising & Analytics Companies


I think you overestimate how many "regular people" use an adblocker.


I would agree with you. Those aren’t my words, just two quotes from the article with some relevant info. As others have mentioned, it is email addresses that have been leaking and continue to leak.


Yup, like close to no regular people use an ad-blocker.


AdSense actually prohibits it. They warned me that they'd suspend me if i didn't remove an email variable from the URL.


This may be the first time I've seen an article try to sensationalize webhooks and third-party APIs. When I read the headline, I was expecting some kind of hack, not a story about how Dave in IT hooked the contact us form up to the CRM using webhooks and Zapier...

The meat in the story, is a real problem - irresponsible mingling of PII in analytics data.


The amount of times I've heard "Well, it's just analytics data, it's public anyways" from people just drives me up the wall. No, it's not public, it's still PII, you still have to guard it correctly!


It is very hard to make people understand that scale matters when deciding how sensitive data is. They only ever care about it in whatever narrowly defined use case they're worried about. The idea that someone can take an element of not-particularly-sensitive data from you, combine it with elements of not-particularly-sensitive data from elsewhere, and end up with a database full of extremely sensitive data simply does not click.


Worked for a an analytics company, and this was a problem for us, because we didn't want to collect pii at all.

So as soon as there was an @\S+\. in a url we were anonymizing the full url. Customers were not happy though

But I can tell you that this was present on a lot of websites including Bank websites


This is explicitly against the Google Analytics ToS - what's the bet they do nothing about it?


They'll just invoke the standard SV bubble playbook:

1. Do nothing until it gets reported in a large dead tree medium.

2. Blame the reporter for not understanding technology.

3. Deny it happened.

4. Say it only affected a small subset of people.

5. Say it was only a single rogue "trusted partner" involved.

6. Put out the boilerplate "We can do better" press release.

7. Keep cashing the checks.

8. Lather. Rinse. Repeat.


Zuboff's Dispossession Cycle.


This covers query strings, but in theory, if you've got any third party JS on your registration/private page, that JS can get the contents of the form and exfiltrate it.

So, basically any 3rd party analytics has the ability to do this, query string or no?


How I am handling it (required mail server):

Each email adhering to some rule (magicmarker[a-f0-9]+)goes to my account.

Each registration anywhere gets unique email address generated as magicmarker<b64(hash(domain+salt))>@mydomain.com Salt is there to keep it unguessable.

When I get any spam, I can redirect it to /dev/null and verify from where it came from to sent hate mail to domain owner or whatever.

0 spam. 0 tracability. Ability to track who sold/leaked my mail address.


If others reading this don't want to run their own mail server, other services offer the ability to generate suffixes (a la `+SignupPageFoo`) via other characters.

The one I use (purelymail.com) allows for underscores to serve this purpose, which will never be stripped. It's also super cheap (less than $1/month), though because of its low volume and AWS IP, my messages have a problem with getting marked as spam. The next comparable service I found that offered effectively infinite addresses was $50/year, so I fall back to gmail if I really need to send email.


This is not adequate. The + character is very well known part of standard and it is very simple to remove it, one search and replace would do on whole email list.


The + character is only "part of standard" for gmail and whoever else chooses to copy them. This is why I mentioned that the provider I'm using allows _ to be used the same as +, which will not be stripped or replaced by anyone ingesting email addresses.


Fastmail supports email domain wildcards / catch-all rules whereas providers like Outlook do not.


I thought the bad guys knew to strip the markers by now.


That + symbol is part of a pseudo-regex; it’s not part of the email address. You’re probably thinking of the email+whatever@whatever syntax, which isn’t what’s being described here. There’s nothing to strip in this case.


Whoops- you are correct. Pre-coffee me mis-read that. Thanks for the heads-up.


According to the RFC ‘+’ is a valid character and tagged addresses (which aren’t mentioned) represent unique users (so removing the marker could result in the email not being delivered at all.)

IME: I’ve never had anyone remove it (although some don’t accept it and others will have bugs that result in an unusable account if you register with it.) I used one with the company I most recently interviewed at (they used a third party service for the HR site and I’ve had trouble with sites like that selling my email address which is just so enraging honestly.) Every. Single. Time. I logged in to fill out a form after getting hired I had to call HR and have them reset my account because of some weird bug.


According to the RFC ‘+’ is a valid character and tagged addresses (which aren’t mentioned) represent unique users (so removing the marker could result in the email not being delivered at all.)

Though similarly, email usernames ('local part') are case sensitive but I've never encountered a mail server where this was the case. I imagine if a nefarious party stripped the markers, they'd lose the tiniest percentage of their audience that way.

(Aside: Your own approach to using + is quite clever as it's the total opposite to most users, so you'll see who pulls this trick ;-))


The trick is to forward emails with no marker to spam. Sure they can strip markers but I assume they don't replace markers to match my "canonical" email address.


You will have to elaborate, I don't understand. Marker is here just to handle mail on server side and I can easly upgrade it to hash of hash + salt and handle it programatically on server but there was never any need for it. And anyway I couldn't care less, without it mail is invalid, still 0 spam. And I am doing it for last 10 years so it is battle proven.


I think the responder misread your message and assumed you were adding a suffix like \+\w+ to the end of your email addresses like name@gmail.com -> name+suffix@gmail.com

Several email providers treat a + symbol as the end of the first part of the address and ignore everything between it and the @.

I think the responder’s point was that analytics providers just ignore things after a + too


Yes, that's exactly what happened. Oops.


On a side note, on Firefox Android the page here freezes for about 10 or 15 seconds while loading. It's really annoying.


Product managers are seeing "longer time on site" in their analytics reports and keep adding more things thinking it is meeting the company's quarterly OKR

they don't know the "higher engagement" is because the mobile user's browsers are literally frozen

and the A/B test says "keep going with the B test!" "do it again!" in a tree that keeps evolving down one side of the graph towards more and more obnoxious experiences that the company doesn't even know is obnoxious

given the misaligned incentives I think this is also an area California can regulate or threaten to regulate, I don't like "tech regulation" but I can't think of any other party to curb the behavior. If you like "private sector solutions" more than "government solutions" then Apple and Google can pull the rug under all the other company's feet by crashing sites on the user's phone using other user's crowd sourced data, or making certain analytics packages not run, etc.


... or maybe it's just a bug?


yes the bug of product managers following A/B tests blindly and not knowing its a bug, resulting in a worse and worse internet browsing experience for all of us


Any company with the ability to measure time-on-site is also at least opening their site.


Yeah, but probably with a specific browser/os configuration, on high speed office internet.

Pretty common to find large websites that just happen to not work right in browser configurations that don't match their developer configuration even if they aren't obscure.


Underrated comment. You’re on the money.


Working fine for me on Android 68.7.0esr with uBlock Origin.

The website isn't exactly quick to load, but that appears to be because it is very large.


Interesting. I wonder what the hang up is.


Seems it's every medium site for me, so probably my adblocker or something.


I am working as a data analyst. We had some cases were we were called to fix issues from other agencies.

The clients had forms data being sent as get-requests and from there email addresses and even more personal data in the URL (street, date of birth, and even more) was being transmitted into the analytics tools and also into marketing tools.

Regarding GDPR this is a breach and needs to be communicated to officials as well as the people affected.

Even a bank was affected by this type of implementation when customers wanted to open an account or make a loan application.


HTTP 101: do not transfer anything you don't want 'cached' as a GET request. Not only that, but some browsers will pre-emptively send GET requests or retry them so you'd have the double headache to worry about duplicate requests on the server-side.

It shouldn't require much experience to know when to use POST or some other HTTP verb - banks certainly have no excuse.


Email 101: Clickable links in emails are always GET, so extra parameters are set in the query string.

Marketing 101: User actions should take as little clicks as possible, so the action should be performed as soon as the user clicks the (GET) link.


> Marketing 101: User actions should take as little clicks as possible, so the action should be performed as soon as the user clicks the (GET) link.

Nope, some email clients might prefetch urls in email for various reasons. You should absolutely NOT do this (unless you are decitefully trying to game you engagement metrics.) The only case where you might be able to get away with it is when the user has an active login session that you can verify prior to performing the action.


In the case of email, the sender already knows your email address and so should have no need to put it in the URL. The URL should only have some long random or pseudorandom identifier that has no meaning to anyone but them.


Everyone in sales is using the leaked emails. I have so many one-time emails that leaked (haveibeenpwned.com) and someone was smart enough to just use those leaked databases and sell the emails to sales departments.

I always ask the sales person where the hell they found the email because I just used it once somewhere long time ago.


I have a bit of a contrarian view on data on the web. I think, eventually any data on the web is going to be in the public domain at some capacity. Data will be everywhere and readily available, mostly for free.


You sound like a public television series circa 1985 talking about this new-fangled thing called "the internet."


For many of the examples described in the article, it is the site (not Google or Facebook) that left the email address in the URL. That is bad coding practice. Just so we are clear, it isn’t the “evil advertising companies did this”. Now asking Google to randomly search all unwanted referral URLs received for a customer specific pattern and delete what may look like an email address seems unfair. Google in no position to use or recognize that data as email. If they tried that, it would be brittle and unmanageable.

One could argue Advertising shouldn’t exist or that Google should not store anything. But the GDPR argument is BS, although admittedly legal.

It is like throwing a small rock in to the neighbors yard and asking them to retrieve it for you.


Google will also ban your GA account if they find that kind of information showing up. They aren't dumb and know exactly what they can and cannot get away with.


Emails on query strings are leaks that should be patched. That said you can bet these companies are sending emails over formal integrations to tons of 3rd parties for analysis, targeting, advertising, etc.

CCPA is not nearly as strict as the GDPR and it is not illegal, unfortunately.


This is the most accurate assessment. Considering how "not secure" email is in general and how easy it is for this information to be passed around behind the scenes this is almost a non-story.

I feel this article really stunk of an attempt to over-sensationalize some sloppy coding that is probably happening on 50% of the websites in the world. To think otherwise is nothing but a utopian view of reality.


Email addresses not emails.


“Emails” is a valid real-world variant in use by non-technical people.

“Do you have each other’s emails?” is real and normal usage.

It’s fine to “get off my lawn” this, but it won’t help solve the data leak posted here.


Yeah, sure, in casual converstion, but in this case I read half the thing before it was clear what they meant, and it makes a big difference because leaking email addresses isn't the same thing as leaking emails, eh?

(BTW I think it sucks you're getting hammered by downvotes FWIW I upvoted you just to counter balance them.)


Sure it's valid but that doesn't mean it's not ambiguous. All it would take is an initial clarification (once near the top) that they're using "email" to mean "email address" and not "email content" and then we can all give this the correct level of attention.

I'm not for a moment suggesting this isn't bad but if you run about warning people of something important it helps to be clear and not risk exaggeration / ambiguity (or you play into downplayera hands and potentially waste the time of honest folk)


This usage shouldn't be encouraged. We don't have the same double meaning with snail mail. Why should we have it with email?


I was quite confused at the begining of the article until realized they were talking about addresses.


sure; it's fine for colloquial use, but terrible for a headline where the term is ambiguous, and the more common meaning carries a more sensational implication (that is wrong).


Difference between written and spoken language. Also content matters.

The title is plain wrong.


^ this should not get downvoted.


You are quite right. The article seems not to know the difference or is very sloppy in applying the correct term.


They are leaking email _addresses_ not emails. I was irritated how they manage to leak emails in a context where no emails are involved (but email addresses are).

I which people would be a bit more clear in the language they use, especially if it's about vulnerabilities.


You are trying to impose your own internal representation of email as a concept on others here.

To write "email" is correct. It would be more precise to write "email address" to make the distinction from "email message." But it is not the cultural norm to equate "email" with "email message" as you seem to do.

Juuust kidding. My point is: they probably do not have the used more clear language, because their internal concept of what email is is so fuzzy. That would be my guess anyway.


It’s because emails would get more clicks than email addresses.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: