It seems that in modern large scale systems networking continues to be one of the few things were a a seemingly small and inconsequential change can cause entire cloud providers and highly redundant systems to go down. It makes sense as networking is the fabric connecting all systems together but each time an incident like this occurs I'm reminded of just how important networking is.
Network engineers and the people handling network ops always amaze me.
IME Network engineers put too much faith in vendors. They think "the vendor says this is a resilient virtual chassis so it can't break", rather than thinking "ok, if this breaks what happens"
A crash affecting both sides of a "resilient" virtual chassis I had to work with took off a major broadcast last year (it was a last minute favour I was doing, and I rerouted to a tertiary route in a couple of minutes).
Meanwhile I ran a rather large event going out to some hundred million listeners via two crappy £300 switches which were completely independent of each other, into two independent routers, running via two separate systems (one on a UPS, one on mains). If one of them broke the other one was completely independent and the broadcast would have continued just fine.
As far as I am concerned, that is far better than a virtual chassis.
This may be true of enterprise network engineers but I’ve worked across a lot of very large networks (telco, not cloud) and we never ever trust the vendor.
The kind of bugs that I’ve read about in errata notes over the years is wild and truly unpredictable.
Enterprise is definitely different - network guys need multiple customers to develop the vendor skepticism. I used to get into brutal internal fights with network directors over whatever bullshit the Cisco salesman said offhand that was treated as though it was delivered by Moses off the mountain. One guy tried to get me fired because I offended an SE. lol.
I worked on systems and platforms at the time, and we were more cynical even about vendors we liked.
It wouldn't be the first time that your redundant vendors end up sharing a conduit for a bunch of fiber somewhere. Guess where that backhoe will start digging?
Redundant vendors in the GP’s context referred to using multiple router vendors, eg Cisco and Juniper.
Using multiple connectivity vendors doesn’t guarantee path diversity. Demanding fibre maps and ensuring that your connectivity has separate points of entry into the building, doesn’t cross outside the building, and validating with your DC provider that your cross connects aren’t crossing either, guaranteed path diversity / redundancy.
Its a bit of both. Internationally I find I can't trust the network maps of the connectivity vendors and I'm better going for two separate companies (ones which are part of different subsea cables -- e.g. Wiocc on Eassy and Safaricom on TEAMS).
Of course I had one failure in Delhi which the provider blamed on 5 separate fibre cuts. Long distance circuits can run via areas where they can sustain multiple cuts across large amounts of area (regional flooding is a good one), and fixing isn't instant. This can be mittigated a little, but you still end up with circuit issues -- I had two fibre runs into Shetland the other month. Frist one was cut, c'est la vie. Second one was cut, had to use a very limited RF link. There's only so much you can do.
On the other hand I've just been given a BT Openreach plan which lists any pinch points of a new RO2 EAD install, I can see the closest the two get during transport is about 400m (aside from the end point of course, and experience has taught me I can trust it.
The GP was clearly talking about whole networks, not just the hardware vendors, if I read that different than the GP intended I'll wait for their correction.
One of the problems that I've seen in practice that with the degree of virtualization at play that it has at the same time become much more easy to in principle be guaranteed 100% independence and in practice it has become much harder to verify that this is the case because of all of the abstraction layers underneath the topology. One of my customers specializes in software that allows one to make such guarantees and this is a non-trivial problem, to put it mildly, especially when the situation becomes more dynamic due to outages from various causes.
In London I can literally follow the map from manhole to manhole, exchange to exchange. It's dark fibre so I can flash a light down it and a colleague can see it emerge at the other end. Now it's possible they don't follow the map and still make it to the other end, but it's pretty unlikely.
Sometimes of course you have to make judgement calls. From one location near Slough I have a BT EAD2 back to my building a few miles away. I know the route into my building, I can see the cables with my own eyes going in different directions. BT tell me which exchanges those cables goto, and provide me with a map into the field at a 1000:1 scale showing the cables coming in down a shared path. Sure it's possible BT are lying, but it's unlikely. Only use that location sporadically, and when I do it's a managed location, so I can accept the risk of a digger on the ground.
Another location in Norfolk, two BTNet lines, going to two different exchanges. They meet at the edge of the farm and go up the same trunk. That's fine, I can physically control the single point of failure there too, although if peering between BT and my network fails then I'm screwed, but I have a separate pinnacom circuit in a crunch.
Now obviously some failure become far harder to mitigate. A failure of the Thames Barrier would cause a hell of a lot of problems in Docklands, I'm not sure if any circuits in/out of places like telehouse, sovhouse, etc will remain. Cross that bridge etc. Whether my electricity provider will remain with a loss of the internet is another matter, so then it comes down to how much oil there in in the generators, and the generators of any repeaters on the routes of my network.
However the much easier to avoid is the problem of some shitty stacked switch the salesman says will always work.
I have to trust the dark fibre map provided, but I know exactly which way it ran, manhole to manhole. I had three cores, they shared the first 20 metres to the manhole, it's unlikely there would be a backhoe digging underneath the police van and pile of scaffolding that was parked in the shared conduit.
After that it went on different paths to three different buildings, which from each of those was then routed independently.
We take physical resilience seriously, as it isn't network engineers that do that part of the infrastructure. Enterprise network engineers then throw it all away by stacking their switches into a single point of logical failure.
(Still had a non-IP backup, but sometimes that breaks too -- just in different ways than the IP)
Yes, exactly. Most really mission critical places do exactly that.
The first time I saw something like that put into practice was when an experiment in the oil and gas industry that was scheduled to run for years delivered their network design. On the runtime cost of the experiment the extra network wasn't a big deal, but a service interruption would have been and would have caused them to have to restart the whole thing from scratch. It's more than a decade ago and I forgot what the exact context was but the whole thing was fascinating from a redundancy perspective as well as the degree of thinking that had gone into the risk assessment. Those guys really knew their business. Also the amount of data that experiment was expected to generated was off the scale. Multiple petabytes, which at the time (a decade ago or so) was a non trivial amount of data.
yes, instead of one network, many independent networks which then can get connected together, forming a network of networks, some kind of inter-network!
This doesn't really make sense. The modern WAN operates on multiple independent networks - SD-WANs, multiple transit providers, fiber-ring MPLS, EVPN etc. If you propagate a bad network change throughout your autonomous system or backbone you can still have an outage on your hands.
That still doesn't make sense though. In the context of a WAN, a backbone is an external network. It routes between your POPs. At any rate, the margin of error and complexity in having two separate backbones networks managed by two separate teams would likely result in more network issues not less. The whole point in having an AS is having a coherent routing policy.
The parent was stating that two networks would be better but its none done because of costs. And that's complete nonsense.
The fact that it's more difficult and complex to have two separate teams manage two separate networks means it's more prone to error and misconfiguration. The reason it's not done has nothing do with financial costs but rather because it makes no sense, for the very fact I just mentioned.
Actually I have seen a setup that was quite close to this. Two separate networks, one of them was completely isolated from another, didn’t have Internet access and used a separate set of network equipment. On top of that, the building itself had two entrances - one for the boss and another one for the personnel. You physically couldn’t get from one part of the building to another. It didn’t help the boss though - he was blown up in his car one day. Fun times.
You're talking like multihoming doesn't work. Sure there are cases where bugs or bad configs can propagate across ASes but most of the time you can survive if one provider goes down.
And that's exactly where the whole "have two backbones managed by separate teams instead of one" stops. If someone pushes out an incorrect network config to the end box then all that "let's have two of everything" becomes completely worthless. And as far as multihoming everything and having every single box on the network act as router, unless you are running a CDN of some sort, really makes zero sense. You seem to be arguing that adding more complexity will automatically result in better reliability.
Network engineers and the people handling network ops always amaze me.