Modern anti-spam and E2E crypto (2014)

Dolores12 · on Aug 17, 2016

Its kinda fun to see he didn't mention at all that google won't be able to serve personalized advertisement if a message is encrypted. And since google is an advertisement company by implementing this it will shoot itself in the foot.

mike_hearn · on Aug 17, 2016

Because I was asked to talk specifically about spam by the moderator of the list, not all issues that true E2E crypto would raise.

Regardless, this is one of the easier problems to solve. Lots of people pay for Gmail and they can disable adverts. It's businesses and universities, mostly. In a world where email was fully encrypted that would presumably be extended to everyone and you would have to pay for consumer email accounts too.

The problem with doing it that way is that lots of people couldn't/wouldn't pay for email, and the internet benefits greatly from being able to assume everyone has an email account. If that assumption were to be violated then everyone would have to rethink their authentication systems, for instance, because there'd be no way to send password reset emails to people. So there'd be massive externalised costs from ending free, ad supported webmail.

wtbob · on Aug 17, 2016

> If that assumption were to be violated then everyone would have to rethink their authentication systems, for instance, because there'd be no way to send password reset emails to people.

I don't think that's necessarily true. Running email is simple enough that ISPs (to include mobile ISPs) could return to doing it for their customers. Heck, if you're a node on the Internet then by default you can receive email anyway …

mike_hearn · on Aug 18, 2016

Many internet users are not customers of an ISP.

nerdponx · on Aug 20, 2016

How is that possible? Do they just do all their Internet browsing at Wi-Fi hotspots and libraries?

mike_hearn · on Aug 24, 2016

Consider developing countries.

yeukhon · on Aug 17, 2016

I always find this really interesting. I am a Gmail user and I think I have only clicked on an ad once or twice in the life time of my usage of Gmail, since launch.

In my naive thought, Google would care the conversion rate, how many users are clicking the ads. But in reality, maybe they don't, because advertisers don't want to miss that millions of audiences. But how effective is ads on Gmail.

Instead of displaying ads, why not send me digest? They know my location, they know my search history thus far, they can recommend things to me without actually reading my email. My inbox is like my physical mailbox. I get ads every week. Send that over in a digest.

Disclaimer: I am not a big anti-ad person. I hold privacy to a high regard but as long as privacy is respected in some form of opt-in and opt-out format, and data retention is low enough, I am okay.

Pyxl101 · on Aug 17, 2016

Any webmail provider can also read the contents of your email. Even if they don't display ads alongside your email content, be aware that they could data mine it to learn all kinds of things about you. If you receive email receipts of purchases that you make, for example, then your webmail provider knows what you're purchasing. They probably know which companies you have business relationships with, where you travel, who you communicate with -- and keep in mind all the things they probably know about the people you communicate with.

This doesn't necessarily bother me. It's convenient to be reminded of an upcoming flight (that was sent to me as an email from a carrier) when I go to run a search for something related like "flight status". That's the kind of automated intelligence that really can make your life better (in small ways). But be aware that there's a lot going on behind the scenes to achieve it.

Imagine the kind of intelligence capability that would exist if you could tap the fact database of a platform like this with AI-style human questions. "OK Google. So, my friend John. What magazine does he read most often?" "I see that New York Times sent him a receipt for his subscription last month. He seems to get a new copy every week by email, and he clicks on 50% of those emails." "Interesting. OK Google, what's John bought from Amazon and Ebay?" "Based on the receipts and shipment emails, John has recently purchased ..." "Where will John be next weekend?" "John has a flight next Friday at 3 PM". Imagine for a moment how much you could probe into other people's lives with the ability to query information like this, and then realize that your email provider has this information about you. Their internal controls prevent you from accessing this information about other people, but they can access it about everyone.

Email is more likely a treasure trove of demographic and other information about you than it is an ad-serving medium directly. This could be true with the browser as well, and is true in limited cases, but from my understanding browser vendors don't currently data mine everything that you browse in general, but their terms of service do allow this with email. (Various Safe Browsing type features do allow HTTP requests to pass through central servers, and browsers like Chrome also track various things about web traffic like what TLS certificates a site presents, to look for MITM attacks. In principle these browsers could track anything/everything. Whatever they track can be observed by the user, though.)

As far as data retention goes, that's hard to know for sure. The data that's displayed to a user is often quite different than how data is tracked in the system underneath. Clicking "Delete" on an email might just set the "deleted" flag in the system. Before relying on retention I'd recommend checking out the terms of service to see what guarantee the company makes about when they'll expunge your content. Without a guarantee in your terms of service, deleted/expired content may as well just be hidden, and there may be a lot more data in the system than you're being shown. Furthermore, even if the data itself was deleted, that does not mean the machine-learned insights from past data have been deleted along with the original data. For example, if you regularly search for programming content on Google, then Google might learn that you're a programmer, and perhaps this could persist even if your individual web searches are deleted.

Service providers are also restrained by their desire not to be creepy. Google might know something about you from reading your email, but might also realize that you'll freak out if they surface that fact obtrusively in the wrong way -- like if you buy a dildo (which they learn by reading your email), they won't start showing you search advertisements for porn. This is the advertising industry and it's obvious that even people who are very into porn and dildos wouldn't want that (all the time). So what kinds of ads that get run, and what kind of insights and features surface from the data they have, will always be subject to human judgment.

By comparison, if you buy a plane ticket (which they learn by reading your email), and search for "flight status", they'll feel comfortable showing you your upcoming flight. The insights available to them far exceed the insights they plug into user-facing features. There are also subtle features they can take advantage of like accurately estimating your age, gender, background, education, whether you own a house, etc., which can increase the precision of various marketing aimed at you. Facebook of course has a lot of these insights as well, through their social graph, though they miss the insights that email conveys regarding your relationship with other businesses. Facebook has parlayed their demographic information about their customers into a tremendous advertising platform, which has the ability to target ads to people in very specific ways based on information about them, which has been until recently unmatched in other platforms.

See things you can target on with Facebook: https://www.facebook.com/business/help/433385333434831

> Custom Audience: Use email addresses, phone numbers, Facebook user IDs or app user IDs to create and save audiences you'd like to show your ads to. Location, Age, Gender and Language: Choose the basic demographics of the audience you want to reach. Interests: Choose specific interests that are important to your audience. These are determined by what people are connected to on Facebook, such as Pages and apps. Behaviors: Select people based on purchase behaviors or intents, device usage and more. These behaviors are determined by what people are connected to on Facebook, such as Pages and apps.

Google's closest comparison AFAIK is "remarketing lists for search ads": https://support.google.com/adwords/answer/2701222?hl=en

Being able to target ads so precisely can considerably increase their value, compared to a broadcast to everyone. Facebook's ability to support such specific targeting is probably a reason behind Google's attempted pivot into Google+: owning a social graph gives you a lot of direct demographics you can't collect easily another way, which then facilitates precise, high-value ad targeting that largely isn't creepy. Data mining people's email might give you the information, but people might not be comfortable seeing a very obvious connection between what's incidentally in their email and an ad or other experience somewhere else. It's not clear how this information in email is being used today, but given that your receipt of an email can impact the behavior of experiences elsewhere on the platform, it's probable that this content is being data mined in various ways.

I haven't reviewed the privacy policy in full, but see the following section: https://www.google.com/intl/en/policies/privacy/

> There are many different ways you can use our services – to search for and share information, to communicate with other people or to create new content. When you share information with us, for example by creating a Google Account, we can make those services even better – to show you more relevant search results and ads, to help you connect with people or to make sharing with others quicker and easier. (...)

> We use the information we collect from all of our services to provide, maintain, protect and improve them, to develop new ones, and to protect Google and our users. We also use this information to offer you tailored content – like giving you more relevant search results and ads.

That's essentially saying that they can do whatever they want with the data (for the legitimate purpose of providing and improving their services). There is no expectation that your email data stays within Gmail - it might improve their search product or your search experience, and so on.

I am largely not bothered by this because I see the mutually beneficial nature of effective advertising: it will help me discover things I'm actually interested in. I enjoy browsing the Steam store to find new video games regularly, for example, and a big part of it are its recommendations (based on what you play, presumably, and have rated) and its promotions. Similarly, I enjoy reviewing my Netflix and Amazon movie recommendations. The more these systems learn about you, the better job that they do finding you cool stuff. Google is the same way: the search gets better and its ability to anticipate your needs gets better. Effective advertising largely does not want to be and does not intend to be manipulative. It serves a purpose of staking the relationship: the product provider puts skin in the game by placing a bet that you'll be interested in it, by spending money to buy your attention. When a consumer finds something they enjoy and purchases it or consumes it for free with ads, everybody wins. There are manipulative elements to classic advertising, but I think the Internet age of personalized, highly targeted advertising is helping business get beyond that.

wtbob · on Aug 17, 2016

> It's convenient to be reminded of an upcoming flight (that was sent to me as an email from a carrier) when I go to run a search for something related like "flight status". That's the kind of automated intelligence that really can make your life better (in small ways).

The thing is, that sort of useful service can survive with end-to-end encryption: when I decrypt my messages on my device, I can pass them on to any agent I wish, to include a check-for-flights agent, a check-for-packages agent &c. — and those agents can then post status messages on my local (or even a remote — encrypted, natch) queue. I don't need a cloud service to run a few regexps; my phone, tablet, desktop & laptop are plenty powerful enough to do that.

dredmorbius · on Aug 17, 2016

That's an excellent history and summary of the spam situation. It adds a few pieces I'm not familiar with (I'd gotten out of the fight by the late 2000s), but validates a few approaches I've also considered.

Reputation is a crucial element, and ultimately the crucial element in large networked systems.

I'd make one correction about new accounts: if a new entity (email address, domain, IP) starts peering, SMTP actually offers a really good options for limiting: a non-permanent rejection. If new traffic shows up from an unknown source, you can simply tell it "not right now". In theory, the sending server is supposed to follow a back-off protocol, with a few minutes delay initially. This gives time to start building up reputations through other means. And failure to follow the retry schedule is a strong sign of misbehavior itself. Teergrubing uses this mechanism.

The idea of vouched reputations is another one. If all peers or senders need to have someone vouch for them, then even new agents can engage in email.

The one resource all clients, regardless of compute power, access uniformly, is time. A protocol mandating a specific time interval (say, by requesting a specific future NIST randomness server value), might be interesting.

Another element that's helpful to the spam-fighters is that most communications patterns follow a strong frequency curve -- most comms are well-established, whilst less-frequent comms are often spam. Whitelisting based on established comms (and overriding those with specific blacklists), and treating all novel comms as suspect until some token of merit is received (a vouch, time delay, earned reputation elsewhere) should help.

The idea of multiple levels of encryption, with an outer envelope which an email server can read, with an inner envelope for the actual recipient, is another option. Here trust information can be placed in the outer envelope, but contents are protected. Possibly a protocol might request contents must be readable by the server for certain classes of sender.

mike_hearn · on Aug 17, 2016

Spam filters do treat mail between friends differently and this is why spammers switched to hacking accounts (vs creating them fresh) and then spamming the contact lists about 6 years ago.

Reputation is effectively a way to do your token of merit proposal. Reading a mail and not reporting it as spam is taken to be the "vouch" but of course that requires the mail to be delivered. If you open an account at a high reputation mail provider and send, then that's earned reputation elsewhere and it's up to that provider to keep a lid on people trying to abuse that reputation.

> The idea of multiple levels of encryption, with an outer envelope which an email server can read, with an inner envelope for the actual recipient, is another option. Possibly a protocol might request contents must be readable by the server for certain classes of sender.

That's PGP. The assumption behind E2E crypto is that the infrastructure provider can't be trusted, so if you let the infrastructure provider automatically request unencrypted copies of an email then it seems the point is lost.

wtbob · on Aug 17, 2016

> The assumption behind E2E crypto is that the infrastructure provider can't be trusted, so if you let the infrastructure provider automatically request unencrypted copies of an email then it seems the point is lost.

The difference would be that I get to choose which emails I send unencrypted. I might have no problem sending the latest Sears or LinkedIn semi-spam back up to the provider, but might have a problem sending the results of a blood test back.

dredmorbius · on Aug 17, 2016

Hacking (or spoofing -- Joe-job spam) accounts is an issue.

If a messaging system requires some local-only authentication token -- a keyfob or bracelet-generated code, say. Or possibly a token integrated into hardware (EUIDs on smartphones, etc.). Or some mix of the above.

"Who are you?" is the most expensive question in information technology. No matter how you get it wrong, you're fucked.

An advantage of a many-server, or peer-to-peer system, is that the sending server's reputations themselves are far more individually chuncked out. Gmail, as a spam source (or Yahoo, or AOL, or Hotmail, or ...) is a problem because it's almost certainly nonviable to reject all messages from such a major, centralised, service provider.

But if the server operates for only an individual, or family, or small neighborhood, spam-originating issues become both more specific, and more tractable for that site's operators.

The two-level encryption system need not be PGP. In some senses, it's TLS + PGP, as session* information is encrypted to all but the receiving host, though alternate systems might be presented.

A voucher system -- where other peers vouch for an unknown one -- might be interesting. Incoming mail shows up, receiving server sends a "who vouches for you?" request (either to the originating server or via a DNS-like system). The vouches are assessed. Vouchers' own reputations are on the line as well, and there should probably be some closed-loop feedback here.

Which gets to another issue -- a huge part of the email spam problem is the failure to close loops. That's where centralised systems have helped, somewhat, simply in generating enough critical mass to get things moving. But to take as an example (and one I've had massive issues with), Yahoo has a specific format they want their spam reports received in. I'm totally down with that.

But there are no end-user tools (at least at my last check a few years ago) which will generate those reports.

There's no script I can bounce mail through to generate a Yahoo-specific spam report. And the attempts I've made to self-report spam to Yahoo (a practice I've long since abandoned) meet with multiple levels of cluelessness.

This is a problem at any level of messaging infrastructure -- the inability to hold to account either the sender directly, or the sender's infrastructure. Take phone and SMS spam. Spoofing of origin numbers is rampant, and carriers aren't care-iers -- they don't care. "I'm sorry, Mr. Beedle, you don't understand. We're the Phone Company. We don't care. We don't have to."

Impunity is the root of much evil.

wtbob · on Aug 17, 2016

I wonder to what extent he's actually correct that end-to-end & reputation really aren't compatible. After all, if one ignores forwarding for a moment then a server will always know the peer who is sending it email, and if it knows that peer then it can calculate reputation for it (it can know the total emails it has received from that peer, and the number reported by users as spam). The problem then becomes one of making it more expensive to create a new peer, which I think is doable.

Xeoncross · on Aug 17, 2016

I've had a hard time finding discussions like this from people who have actually tried to link privacy with global communication. I would easily pay a subscription fee to read papers and ask questions from people who have worked on P2P networks, email providers, PGP alternatives, etc...

Good Privacy & Free, Open Communication seem like two things we just can't seem to reconcile.

jcranmer · on Aug 17, 2016

The ultimate takeaway is that abuse is the price you pay for anonymity--the more you give leeway for people to be jerks with minimal repercussions, the more jerks will take advantage of it. Reputation analysis for email drove down the anonymity (in the sense that email service providers have a much greater stake in policing their users to prevent abuse), and it remains the single most effective anti-spam strategy.

nerdponx · on Aug 17, 2016

If you need people to trust your company enough to not encrypt their data, you shouldn't play fast and loose with their privacy.

loup-vaillant · on Aug 17, 2016

> Botnets appeared as a way to get around RBLs, and in response spam fighters mapped out the internet to create a "policy block list" - ranges of IPs that were assigned to residential connections and thus should not be sending any email at all.

This is the wrong way to do it.

It may make things easier for Google. It may save them time, money, and worries.

This is still the wrong way to do it, for one simple reason: it is now impossible to send email from home. Instead, you have to have a relay or a VPN that does it from a fixed IP that'd better not belong to a subnet in disrepute. It keeps the email network centralised, an antithesis to its origins, where an email went straight to the machine of the final recipient, from the machine of the original sender —with no intermediates besides the DNS servers.

I cannot help but notice this centralization is convenient for Google and other huge mail providers. Whatever their actual intentions when they did this I don't know, but they clearly have no incentive to change that back (their business is to read your email for profit after all —the fact this is all automated only makes it worse, not better).

tedunangst · on Aug 17, 2016

> an antithesis to its origins, where an email went straight to the machine of the final recipient, from the machine of the original sender —with no intermediates besides the DNS servers.

Wow, is that some historical revisionism. Emails used to look like thrash!cmu!ucb!vax!tedu which meant send it to thrash, who will send it to cmu, who will send it to ucb, who will send it to the vax, which will deliver it to tedu. If you had a direct line, you could skip a few hops, but very few people were talking directly to the machine of the final recipient.

drfuchs · on Aug 17, 2016

No, you're the one with incorrect history. The uucp email scheme with all the exclamation points actually appeared the better part of a decade after ARPANET email with the simple, direct addressing and delivery was in wide use. Perhaps you are confused because UNIX came late to the nascent Internet, while the various 36-bit DEC machines and OSes were there from the very beginning. Many thousands of students and academics were sending email to foo@bar.edu for dozens of values of bar before anybody ever saw !ucbvax! in an address.

tedunangst · on Aug 17, 2016

Yes, that works when the network is small enough to put literally every computer in the hosts file. But once your local sysadmin tires of updating the file every time some university across the world plugs in a computer, it starts falling apart.

To the original point, if I somehow plugged my time warped laptop into arpanet, nobody would be able to send me email until I made a bunch of phone calls.

(But thanks for the correction, I'd overlooked even earlier history.)

dredmorbius · on Aug 17, 2016

This misses the point.

Bang paths were used (and I used them) because we didn't have DNS and MX records. Which meant that routing was done manually, by the mail sender.

There was an end-to-end wired connection (usually -- batched and gathered mail, via UUCP was also a thing) between hosts. And pretty much any host, through intermediaries, could talk to any other.

Traceroute will show you a fairly equivalent system today.

The big difference was that the Internet (or its various ARPA / DARPA precursors) was a lot smaller. A few hundred hosts in the early 1980s, a few thousand by the end of the decade. And everyone knew each other. There was a book, literally, hardcopy print, with every site operator's name and phone number. Twice, for multiple indexing methods.

In your example, tedu wouldn't have any cause to deny mail from thrash, or any other host, directly peered, or bang-pathed. At least not until Green Card (which was Usnet, and I remember that too).

jcranmer · on Aug 17, 2016

My suspicion is that DNS block lists have largely been replaced at the major providers by reputation analysis, although the end effect is likely to be the same (which is to say that residential connections have their reputation so shot to hell you effectively can't send email).

That said: do you have a better solution? Note that doing nothing is not acceptable--the cost of spam has been such that consumers have opted for anti-spam solutions over the purity of the original email architecture.

loup-vaillant · on Aug 17, 2016

> That said: do you have a better solution?

Local (each his own) Bayesian filtering. Works well for me, with both Evolution and Thunderbird mail clients, and I have my email out there in plaintext in the open web.

honkhonkpants · on Aug 17, 2016

Nothing Google does prevents you from enjoying the end-to-end nature of email. Send it to whomever you want. However the end-to-end nature also allows the endpoint to reject your message.

loup-vaillant · on Aug 17, 2016

When the "endpoint" has millions of users, like Gmail, it is not an endpoint. It's a hub. I don't care it is technically and end point, in practice it's highly centralised.

It's not like each user chose to reject such and such message. No, Google is rejecting incoming mail for them —in addition to individual spam boxes. That does prevent me from enjoying the end-to-end nature of email to an extent. Oh, and, constant man-in-the-middle spying, which is also not exactly end-to-end.

dredmorbius · on Aug 17, 2016

So, you're right and wrong (and tedungst is as well).

Yes, the original Internet (and its precursors) had very little by way of limitations on what any one node could do. But the original Internet was tiny. I've been in modest-sized offices which had more networked compute nodes than the early 1980s DARPANet had.

The Internet is as much a social phenomenon as a technical one, and one problem is that after we start hitting above about 300 of anything, behaviors start to change. At some point I'd really like to find a good behavior-by-scale reference, but quite simply, an Internet with billions of nodes cannot be run the same way one with a few hundred, or even tens of thousands, can.

And yes, this frustrates me too.

A huge problem is reputation, reputation management, and reputation sharing. I've watched much of the sordid mess of spam and anti-spam measures develop and evolve, and what does and doesn't work.

I'd still like to see vastly better peer-based reputiations systems built directly into communications software (not just email). We'll see about that.

One reason residential dialup, DLS, and cable are service no-go zones is that every small operator has their own system for carving off spam. And most of them are very, very crude.

Lauren Weinstein, an Official Old Fart, announced some time back that he'd simply blackholed all of China for his personal email domain. Too much abuse, no valid content. Sucks if you're in China and want to reach Lauren. Oh well.

Multiply Lauren out a few thousand or million times.

There are a few possible solutions:

1. Buy business-grade Internet. It's expensive, but it's not residential, and you may be able to run services on it. Part of what you're paying for is for the service provider to call you up to bitch about shit flowing over your pipe that shouldn't be.

2. Set up specific understandings with the services you peer with. If it's just you and a few buddies, great. If you're trying to get email through to Google or a self-hosting F-50 company, good luck. They've got policies and shit and take ages. (Google may be fairly responsive, F-50 companies, in my experience less so.)

3. Build (or help build) something better. SMTP email has been with us a long time, has some useful features, and has served us well. But it's also got some tremendous shortcomings.

An alternative with robust encryption, anti-spam features, poll-and-fetch (rather than unsolicited-send), better multiplatform support, and the ability to dispense with overly-formatted email (I still prefer console mailers), might be a good start. Being fully peer-to-peer even better. That gets complicated though.