The network *is* a single point of failure, even if the network itself is redund...

wmf · on Jan 30, 2023

One possible way to fix that is to replace the network with multiple independent networks. It's really expensive though.

jacquesm · on Jan 30, 2023

Yes, exactly. Most really mission critical places do exactly that.

The first time I saw something like that put into practice was when an experiment in the oil and gas industry that was scheduled to run for years delivered their network design. On the runtime cost of the experiment the extra network wasn't a big deal, but a service interruption would have been and would have caused them to have to restart the whole thing from scratch. It's more than a decade ago and I forgot what the exact context was but the whole thing was fascinating from a redundancy perspective as well as the degree of thinking that had gone into the risk assessment. Those guys really knew their business. Also the amount of data that experiment was expected to generated was off the scale. Multiple petabytes, which at the time (a decade ago or so) was a non trivial amount of data.

noorkersz · on Jan 31, 2023

yes, instead of one network, many independent networks which then can get connected together, forming a network of networks, some kind of inter-network!

..oh wait. see what I did? ahhAHAHA

bogomipz · on Jan 30, 2023

This doesn't really make sense. The modern WAN operates on multiple independent networks - SD-WANs, multiple transit providers, fiber-ring MPLS, EVPN etc. If you propagate a bad network change throughout your autonomous system or backbone you can still have an outage on your hands.

wmf · on Jan 31, 2023

My point is that you could apply the same principle internally; have two backbones managed by separate teams instead of one.

bogomipz · on Jan 31, 2023

That still doesn't make sense though. In the context of a WAN, a backbone is an external network. It routes between your POPs. At any rate, the margin of error and complexity in having two separate backbones networks managed by two separate teams would likely result in more network issues not less. The whole point in having an AS is having a coherent routing policy.

MichaelZuo · on Jan 31, 2023

The parent never said multiple networks was easier to implement.

In fact it could easily 2x the cost for the same level of quality, which is why it's almost unheard of for cloud.

bogomipz · on Jan 31, 2023

The parent was stating that two networks would be better but its none done because of costs. And that's complete nonsense.

The fact that it's more difficult and complex to have two separate teams manage two separate networks means it's more prone to error and misconfiguration. The reason it's not done has nothing do with financial costs but rather because it makes no sense, for the very fact I just mentioned.

admax88qqq · on Jan 31, 2023

Two end to end networks would be more reliable.

Like two independent internets spanning from your server to my laptop.

Two completely isolated end to end transports.

That's what OP meant when they said you could make it more reliable by having a redundant network. It's just prohibitively expensive.

Then if one internet goes down in any way I talk to you over the other. That's a fairly straightforward fallback algorithm to implement.

allarm · on Jan 31, 2023

Actually I have seen a setup that was quite close to this. Two separate networks, one of them was completely isolated from another, didn’t have Internet access and used a separate set of network equipment. On top of that, the building itself had two entrances - one for the boss and another one for the personnel. You physically couldn’t get from one part of the building to another. It didn’t help the boss though - he was blown up in his car one day. Fun times.

wmf · on Jan 31, 2023

You're talking like multihoming doesn't work. Sure there are cases where bugs or bad configs can propagate across ASes but most of the time you can survive if one provider goes down.

bogomipz · on Jan 31, 2023

And that's exactly where the whole "have two backbones managed by separate teams instead of one" stops. If someone pushes out an incorrect network config to the end box then all that "let's have two of everything" becomes completely worthless. And as far as multihoming everything and having every single box on the network act as router, unless you are running a CDN of some sort, really makes zero sense. You seem to be arguing that adding more complexity will automatically result in better reliability.