WhatsApp has scaled less than 10x since acquisition. They used to handle ~3M ope...

throwaway290 · on Nov 23, 2022

> WhatsApp has scaled less than 10x since acquisition. They used to handle ~3M open TCP connections per server, and as a result could run their entire operation with under 300 servers.

They switched to a new protocol and grew from 200 million to I guess about a billion users since 2013. If you believe a team of 50 developers could deal with this and not cause extensive downtime and service disruption along the way I pray you never ever manage software engineers.

> Sharding and fan-out solves the burstiness.

Great, at least it's no longer "just add CDN to solve 99%" here;)

> And people overall receive a similar number of messages (and thus push notifications) from WhatsApp as Twitter.

Yeah again WhatsApp has many users but as an engineer you just don't ever have to worry about delivering a message instantly to more than 32 people (512 as of this year), and you never have to account for moderating any of that because it's e2ee and there are no adverts next to the messages. It's basically dumb pipes terminated by one native client. Twitter has to maintain a mix of automated and human review of all UGC and is accessible via extensive APIs and search engine indexed web app in addition to native client.

> Then there are DMs

Let's ignore Twitter's DMs, even without them it's far more complex and demanding than an IM app.

> StackOverflow still runs off of 9 on-prem servers [0].

Yeah, and SO maintenance page or read-only mode is up about once a month and lasts dozens of minutes. What are you even talking about now bringing up a niche programmer-oriented help forum for comparison here?

You may be stuck in the times where Twitter was a RoR-based microblogging platform. It's not been that for years.

snovv_crash · on Nov 24, 2022

I do manage software engineers, focusing on HPC (image processing primarily), and one thing I consistently see from people who work with 'classic' web tech is underestimating what modern hardware can do.

This isn't 2005 anymore, we have multiple parallel 40gb LAN, 64 cores per socket and 2MB of L2(!!!) cache per core, and a full terabyte of RAM (!!!) per server. If you program with anything that makes cache-aware data structures and can avoid pointer chasing, your throughput will be astounding and latency will be sub-millisecond. How else do you think WhatsApp managed 300M clients connected per server without having just the in-flight messages overflowing memory, on top of all the TCP connection state?

Things only get slow when scripting languages, serialisation, network calls and neural networks get involved. (AKA "I don't care if you want docker, a function call is 10000x faster than getting a response over gRPC and putting that in the hot loop will increase our hardware requirements by 20x.")

The more distributed your architecture the more network overhead you introduce and the more machines you need. Running the WhatsApp way with less, higher performance servers simply scales better. Just from the hardware improvements since 2013 there was no reason for WhatsApp to change their architecture as they grew.

And if you think rolling out a new protocol while maintaining backwards compatibility is hard and somehow adding more people will help, I have a team of engineers from Accenture to sell you. I did this straight out of university, to thousands of remote devices, over 2G networks, with many of the devices being offline for months in between connections. You just need a solid architecture, competent people and (I can't stress this enough) excellent testing, both automated and manual. And the team that did this was 6 engineers, and this wasn't their only responsibility.