My experience is ~2.5 years as a senior engineer on the Observability team at Tw...

My experience is ~2.5 years as a senior engineer on the Observability team at Twitter (from 2013-2015). I was part of the migration to Manhattan (mentioned in the post) from Cassandra, new alerting infra, query language design, among many other things. Your adtech experience should make you particularly sensitive to query latencies, so I find it interesting that you're glossing over that.

Yes, only 2% of metrics ever read. That was the same back then too. The kicker is that you can't be sure which will be read, so the systems are built to be able to read any of the metrics with the same SLA. This is especially critical during an outage where engineers will need to quickly read metrics that in many cases were only written only a few seconds ago that they wouldn't otherwise read and are not configured as part of an ongoing alert.

Additionally, the alerting infrastructure that runs on top of the TSDB is configured with 10k+ queries that run every minute. So even at 2% read, you're querying them over and over and over again because you always need the latest data plus whatever trailing data is needed to fulfill the alerting needs (trailing 10m, hour, day, month, etc). This also makes it a particularly hard caching problem.

I can't speak to the state of GCP at Twitter. I know they had started to migrate some things but when I was there it was all colo.

Could they use BQ? Sure it could probably be tooled with partitions/caching/etc to work I guess, but at the end of the day it's not what BQ is designed for which is data warehousing and BI. Could they have used something that wasn't a custom built in-house database? Yea almost certainly (No secret that there is some NIH going on at Twitter well before this)

This happens alll the time on HN so I'm not here to call you out specifically...but it's really easy to read a company's blog post about their infrastructure decisions and immediately scoff and say "psh why didn't they just use X?"...as if you now have the same context that the engineering team has by reading a single blog post.