WhatsApp has scaled less than 10x since acquisition. They used to handle ~3M open TCP connections per server, and as a result could run their entire operation with under 300 servers.
The push notification argument is also overstated. Sharding and fan-out solves the burstiness. And people overall receive a similar number of messages (and thus push notifications) from WhatsApp as Twitter. Besides, these days the push notifications go through Google/Apple servers anyways to reduce the number of open connections needed on the phone side.
Then there are DMs. They are per person so CDNs don't help much (just static assets), but also they shard basically perfectly. So, shard them.
Which in the end leaves the user feeds. Designed correctly, sharding would work extremely well, and what doesn't work could be handled by caching closer to users for those 1k most popular accounts.
Honestly, with the correct architecture, languages and tooling, it could be handled by an experienced 50 person dev team plus another hundred in ops. Obviously Twitter doesn't have the perfect setup, so maybe an order of magnitude more? And if you throw a bunch of subpar engineers and tooling at the problem, nothing can dig you out of inefficiencies at this scale anyways.
And no, I'm not wildly optimistic here. StackOverflow still runs off of 9 on-prem servers [0]. I've seen message queues that can give 200M notifications per second on a single machine (written in C++, for HFT). This stuff is hard yes, but throwing more bodies at it doesn't help past the point your fundamentals are solved.
> WhatsApp has scaled less than 10x since acquisition. They used to handle ~3M open TCP connections per server, and as a result could run their entire operation with under 300 servers.
They switched to a new protocol and grew from 200 million to I guess about a billion users since 2013. If you believe a team of 50 developers could deal with this and not cause extensive downtime and service disruption along the way I pray you never ever manage software engineers.
> Sharding and fan-out solves the burstiness.
Great, at least it's no longer "just add CDN to solve 99%" here;)
> And people overall receive a similar number of messages (and thus push notifications) from WhatsApp as Twitter.
Yeah again WhatsApp has many users but as an engineer you just don't ever have to worry about delivering a message instantly to more than 32 people (512 as of this year), and you never have to account for moderating any of that because it's e2ee and there are no adverts next to the messages. It's basically dumb pipes terminated by one native client. Twitter has to maintain a mix of automated and human review of all UGC and is accessible via extensive APIs and search engine indexed web app in addition to native client.
> Then there are DMs
Let's ignore Twitter's DMs, even without them it's far more complex and demanding than an IM app.
> StackOverflow still runs off of 9 on-prem servers [0].
Yeah, and SO maintenance page or read-only mode is up about once a month and lasts dozens of minutes. What are you even talking about now bringing up a niche programmer-oriented help forum for comparison here?
You may be stuck in the times where Twitter was a RoR-based microblogging platform. It's not been that for years.
I do manage software engineers, focusing on HPC (image processing primarily), and one thing I consistently see from people who work with 'classic' web tech is underestimating what modern hardware can do.
This isn't 2005 anymore, we have multiple parallel 40gb LAN, 64 cores per socket and 2MB of L2(!!!) cache per core, and a full terabyte of RAM (!!!) per server. If you program with anything that makes cache-aware data structures and can avoid pointer chasing, your throughput will be astounding and latency will be sub-millisecond. How else do you think WhatsApp managed 300M clients connected per server without having just the in-flight messages overflowing memory, on top of all the TCP connection state?
Things only get slow when scripting languages, serialisation, network calls and neural networks get involved. (AKA "I don't care if you want docker, a function call is 10000x faster than getting a response over gRPC and putting that in the hot loop will increase our hardware requirements by 20x.")
The more distributed your architecture the more network overhead you introduce and the more machines you need. Running the WhatsApp way with less, higher performance servers simply scales better. Just from the hardware improvements since 2013 there was no reason for WhatsApp to change their architecture as they grew.
And if you think rolling out a new protocol while maintaining backwards compatibility is hard and somehow adding more people will help, I have a team of engineers from Accenture to sell you. I did this straight out of university, to thousands of remote devices, over 2G networks, with many of the devices being offline for months in between connections. You just need a solid architecture, competent people and (I can't stress this enough) excellent testing, both automated and manual. And the team that did this was 6 engineers, and this wasn't their only responsibility.
The push notification argument is also overstated. Sharding and fan-out solves the burstiness. And people overall receive a similar number of messages (and thus push notifications) from WhatsApp as Twitter. Besides, these days the push notifications go through Google/Apple servers anyways to reduce the number of open connections needed on the phone side.
Then there are DMs. They are per person so CDNs don't help much (just static assets), but also they shard basically perfectly. So, shard them.
Which in the end leaves the user feeds. Designed correctly, sharding would work extremely well, and what doesn't work could be handled by caching closer to users for those 1k most popular accounts.
Honestly, with the correct architecture, languages and tooling, it could be handled by an experienced 50 person dev team plus another hundred in ops. Obviously Twitter doesn't have the perfect setup, so maybe an order of magnitude more? And if you throw a bunch of subpar engineers and tooling at the problem, nothing can dig you out of inefficiencies at this scale anyways.
And no, I'm not wildly optimistic here. StackOverflow still runs off of 9 on-prem servers [0]. I've seen message queues that can give 200M notifications per second on a single machine (written in C++, for HFT). This stuff is hard yes, but throwing more bodies at it doesn't help past the point your fundamentals are solved.
0. https://www.datacenterdynamics.com/en/news/stack-overflow-st...