Yesterday's Google outage was a pool of routing servers crashing

Kaknut · on Sept 26, 2020

Dependencies of these big providers like Google, Microsoft, Cloudflare are increasing which results to failure on a wide scale even if one fails. Distribution is the key.

auganov · on Sept 26, 2020

Well for the vast majority of simple apps you're better off failing when everybody else is. People will blame it on you less. When your alternative solution fails and everything else seems to be up the blame will fall on you.

Kaknut · on Sept 26, 2020

I always prefer to have a backup solution which could at least crawl during these situation if not able to walk. I see many SaaS relying only on google/twitter/fb auth but they need to understand that having own system too won't harm them much.

tyingq · on Sept 26, 2020

Google could probably do a better job here and not put so many services on the same pool of L7 devices. Separate pools with smaller groupings would reduce the blast radius.

tyingq · on Sept 25, 2020

In a follow up tweet, he mentions a post mortem is coming. Where would that get posted?

kyrra · on Sept 25, 2020

It tends to be linked from the outage itself, probably here: https://status.cloud.google.com/incident/zall/20010

For example, PM was posted on this previous outage: https://status.cloud.google.com/incident/cloud-networking/19...

advisedwang · on Sept 25, 2020

As this impacted G Suite much more than GCP, it's possible it will be posted on https://www.google.com/appsstatus instead.

heartbeats · on Sept 25, 2020

Anyone know what routing servers? BGP?

kyrra · on Sept 25, 2020

(Googler, opinion is my own, I know nothing about this specific outage).

Google has LOTS of internal routing systems. BGP is about anouncing what IPs a given network can handle, which is not the case here.

Before hitting application level routing, I believe you hit the Maglev[0]. Seems unlikely this was the cause, as it would likely take down all services.

One of the first application layers balancers you hit that is well known is the GFE[1][2]. This is similar to an HTTP reverse proxy, but Google made. I could definitely see this as the cause.

[0] https://static.googleusercontent.com/media/research.google.c...

[1] https://cloud.google.com/security/infrastructure/design#goog...

[2] https://landing.google.com/sre/workbook/chapters/managing-lo...

tyingq · on Sept 25, 2020

Does that match the list of reported stuff that was down? It appeared to hit a wide range of services. Gmail, Analytics, GKE, Google Keep, Meet, YouTube, GCE buckets, Sheets, Docs, Calendar, Stadia, Firebase, Voice, Music, Nest. From the thread: https://news.ycombinator.com/item?id=24585478

the-rc · on Sept 25, 2020

Neither Maglev nor GFE are usually tied to a specific service nowadays, so it could still be either of them. Way back when, some teams or services such as Checkout had to run their own private pool of GFEs. Given Urs' mention of backends, I am slightly inclined toward GFE.

MyelinatedT · on Sept 25, 2020

Sounds more like an application load balancer issue ("routing requests" seems to imply L7) than network routing, but I might be misunderstanding.

mcpherrinm · on Sept 25, 2020

I don't know the details of Google's networks, but I assume something like their Maglev load balancers: https://research.google/pubs/pub44824/

enneff · on Sept 26, 2020

Traffic entering Google's network hits a bunch of front ends that route traffic to the relevenat back ends. I'd guess it's those application-level front ends that were having trouble, rather than anything network-level like BGP.

skim_milk · on Sept 25, 2020

There's a huge """secret""" Google data center in Council Bluffs, Iowa that appears to be in the finishing phases of completion. I talked yesterday to a union worker who is moving to Des Moines to work on a new Microsoft data center there tonight, it appears that work is drying up at this data center here and a lot of the travelling blue collar folk are leaving this area.

I wonder if this data center coming apparently partially online is a part of the problem?

Also, after this he is likely to work on an Amazon fulfillment center next year - impressed by all the (albeit temporary) blue collar jobs created by FAANG at the moment!

jeffbee · on Sept 25, 2020

LOL you mean this one that has a big sign out front that's been there for years and years?

https://www.google.com/maps/@41.2197694,-95.8658016,3a,89.3y...

tedd4u · on Sept 26, 2020

OMG just at the end of the fence there's A BACKHOE RIGHT NEXT TO BURIED FIBER INDICATOR POLES!!!

https://www.google.com/maps/@41.2196753,-95.8611598,3a,75.4y...

jeffbee · on Sept 26, 2020

The predator and the prey in their natural habitat.

ma2rten · on Sept 26, 2020

tidepod12 · on Sept 26, 2020

FWIW he's probably talking about this one, which is newer/still under construction and not as well known (although it's certainly not "secret").

https://goo.gl/maps/sdJYnU4dsSNedrLfA

skim_milk · on Sept 26, 2020

I should have been less esoteric, but yes this one. From what I can tell a gaggle of Google employees have taken over control of the building and I'd assume this absolute unit is in the process of coming online. Sure there's a Google sign visible from the private road leading up to it but you'd still have to be a nosy local (or apparently an all-knowing HN reader) to know the exact location.

jeffbee · on Sept 26, 2020

Even that one opened 7 years ago.

tidepod12 · on Sept 26, 2020

One building of it opened in 2013 (and another in 2016 AFAIK), but it is currently still under construction (in the Google Maps view, the entire construction site south of the completed buildings are also slated to be Google DCs).

joshuamorton · on Sept 25, 2020

Also like: https://www.google.com/about/datacenters/locations/

tidepod12 · on Sept 26, 2020

Aren't there GCP regions in Los Angeles and Salt Lake City? Interesting that those DCs don't seem to be on this public list.

skim_milk · on Sept 26, 2020

SteveNuts · on Sept 25, 2020

I think by now Google has lots of experience with bringing data centers online, so this seems incredibly far fetched to me.

qmarchi · on Sept 26, 2020

Not super well known, but Google has quite a few Datacenters that aren't used for Google Cloud, but are reserved for internal use.

throwawayinfo · on Sept 26, 2020

Probably the other way around, no?

cwxm · on Sept 26, 2020

no he was correct

throwawayinfo · on Sept 25, 2020

Spoiler alert: it's not.

rezonant · on Sept 26, 2020

Oof. This DC has been around since 2007. I can make a VM in it on GCP right now :-)