It seems that in modern large scale systems networking continues to be one of th...

iso1631 · on Jan 30, 2023

IME Network engineers put too much faith in vendors. They think "the vendor says this is a resilient virtual chassis so it can't break", rather than thinking "ok, if this breaks what happens"

A crash affecting both sides of a "resilient" virtual chassis I had to work with took off a major broadcast last year (it was a last minute favour I was doing, and I rerouted to a tertiary route in a couple of minutes).

Meanwhile I ran a rather large event going out to some hundred million listeners via two crappy £300 switches which were completely independent of each other, into two independent routers, running via two separate systems (one on a UPS, one on mains). If one of them broke the other one was completely independent and the broadcast would have continued just fine.

As far as I am concerned, that is far better than a virtual chassis.

ccakes · on Jan 30, 2023

This may be true of enterprise network engineers but I’ve worked across a lot of very large networks (telco, not cloud) and we never ever trust the vendor.

The kind of bugs that I’ve read about in errata notes over the years is wild and truly unpredictable.

Spooky23 · on Jan 30, 2023

Enterprise is definitely different - network guys need multiple customers to develop the vendor skepticism. I used to get into brutal internal fights with network directors over whatever bullshit the Cisco salesman said offhand that was treated as though it was delivered by Moses off the mountain. One guy tried to get me fired because I offended an SE. lol.

I worked on systems and platforms at the time, and we were more cynical even about vendors we liked.

jacquesm · on Jan 30, 2023

It wouldn't be the first time that your redundant vendors end up sharing a conduit for a bunch of fiber somewhere. Guess where that backhoe will start digging?

oarsinsync · on Jan 30, 2023

Redundant vendors in the GP’s context referred to using multiple router vendors, eg Cisco and Juniper.

Using multiple connectivity vendors doesn’t guarantee path diversity. Demanding fibre maps and ensuring that your connectivity has separate points of entry into the building, doesn’t cross outside the building, and validating with your DC provider that your cross connects aren’t crossing either, guaranteed path diversity / redundancy.

iso1631 · on Jan 31, 2023

Its a bit of both. Internationally I find I can't trust the network maps of the connectivity vendors and I'm better going for two separate companies (ones which are part of different subsea cables -- e.g. Wiocc on Eassy and Safaricom on TEAMS).

Of course I had one failure in Delhi which the provider blamed on 5 separate fibre cuts. Long distance circuits can run via areas where they can sustain multiple cuts across large amounts of area (regional flooding is a good one), and fixing isn't instant. This can be mittigated a little, but you still end up with circuit issues -- I had two fibre runs into Shetland the other month. Frist one was cut, c'est la vie. Second one was cut, had to use a very limited RF link. There's only so much you can do.

On the other hand I've just been given a BT Openreach plan which lists any pinch points of a new RO2 EAD install, I can see the closest the two get during transport is about 400m (aside from the end point of course, and experience has taught me I can trust it.

jacquesm · on Jan 30, 2023

The GP was clearly talking about whole networks, not just the hardware vendors, if I read that different than the GP intended I'll wait for their correction.

One of the problems that I've seen in practice that with the degree of virtualization at play that it has at the same time become much more easy to in principle be guaranteed 100% independence and in practice it has become much harder to verify that this is the case because of all of the abstraction layers underneath the topology. One of my customers specializes in software that allows one to make such guarantees and this is a non-trivial problem, to put it mildly, especially when the situation becomes more dynamic due to outages from various causes.

iso1631 · on Jan 31, 2023

In London I can literally follow the map from manhole to manhole, exchange to exchange. It's dark fibre so I can flash a light down it and a colleague can see it emerge at the other end. Now it's possible they don't follow the map and still make it to the other end, but it's pretty unlikely.

Sometimes of course you have to make judgement calls. From one location near Slough I have a BT EAD2 back to my building a few miles away. I know the route into my building, I can see the cables with my own eyes going in different directions. BT tell me which exchanges those cables goto, and provide me with a map into the field at a 1000:1 scale showing the cables coming in down a shared path. Sure it's possible BT are lying, but it's unlikely. Only use that location sporadically, and when I do it's a managed location, so I can accept the risk of a digger on the ground.

Another location in Norfolk, two BTNet lines, going to two different exchanges. They meet at the edge of the farm and go up the same trunk. That's fine, I can physically control the single point of failure there too, although if peering between BT and my network fails then I'm screwed, but I have a separate pinnacom circuit in a crunch.

Now obviously some failure become far harder to mitigate. A failure of the Thames Barrier would cause a hell of a lot of problems in Docklands, I'm not sure if any circuits in/out of places like telehouse, sovhouse, etc will remain. Cross that bridge etc. Whether my electricity provider will remain with a loss of the internet is another matter, so then it comes down to how much oil there in in the generators, and the generators of any repeaters on the routes of my network.

However the much easier to avoid is the problem of some shitty stacked switch the salesman says will always work.

oarsinsync · on Feb 1, 2023

> One of the problems that I've seen in practice that with the degree of virtualization at play

If you’re buying SDN WAN solutions, you get what you get.

If you’re buying specific paths, you get what you pay for.

MichaelZuo · on Jan 31, 2023

Sounds like a great place for a specialized insurance company to be the middle man

iso1631 · on Jan 31, 2023

I have to trust the dark fibre map provided, but I know exactly which way it ran, manhole to manhole. I had three cores, they shared the first 20 metres to the manhole, it's unlikely there would be a backhoe digging underneath the police van and pile of scaffolding that was parked in the shared conduit.

After that it went on different paths to three different buildings, which from each of those was then routed independently.

We take physical resilience seriously, as it isn't network engineers that do that part of the infrastructure. Enterprise network engineers then throw it all away by stacking their switches into a single point of logical failure.

(Still had a non-IP backup, but sometimes that breaks too -- just in different ways than the IP)

jacquesm · on Jan 30, 2023

The network is a single point of failure, even if the network itself is redundant!

wmf · on Jan 30, 2023

One possible way to fix that is to replace the network with multiple independent networks. It's really expensive though.

jacquesm · on Jan 30, 2023

Yes, exactly. Most really mission critical places do exactly that.

The first time I saw something like that put into practice was when an experiment in the oil and gas industry that was scheduled to run for years delivered their network design. On the runtime cost of the experiment the extra network wasn't a big deal, but a service interruption would have been and would have caused them to have to restart the whole thing from scratch. It's more than a decade ago and I forgot what the exact context was but the whole thing was fascinating from a redundancy perspective as well as the degree of thinking that had gone into the risk assessment. Those guys really knew their business. Also the amount of data that experiment was expected to generated was off the scale. Multiple petabytes, which at the time (a decade ago or so) was a non trivial amount of data.

noorkersz · on Jan 31, 2023

yes, instead of one network, many independent networks which then can get connected together, forming a network of networks, some kind of inter-network!

..oh wait. see what I did? ahhAHAHA

bogomipz · on Jan 30, 2023

This doesn't really make sense. The modern WAN operates on multiple independent networks - SD-WANs, multiple transit providers, fiber-ring MPLS, EVPN etc. If you propagate a bad network change throughout your autonomous system or backbone you can still have an outage on your hands.

wmf · on Jan 31, 2023

My point is that you could apply the same principle internally; have two backbones managed by separate teams instead of one.

bogomipz · on Jan 31, 2023

That still doesn't make sense though. In the context of a WAN, a backbone is an external network. It routes between your POPs. At any rate, the margin of error and complexity in having two separate backbones networks managed by two separate teams would likely result in more network issues not less. The whole point in having an AS is having a coherent routing policy.

MichaelZuo · on Jan 31, 2023

The parent never said multiple networks was easier to implement.

In fact it could easily 2x the cost for the same level of quality, which is why it's almost unheard of for cloud.

bogomipz · on Jan 31, 2023

The parent was stating that two networks would be better but its none done because of costs. And that's complete nonsense.

The fact that it's more difficult and complex to have two separate teams manage two separate networks means it's more prone to error and misconfiguration. The reason it's not done has nothing do with financial costs but rather because it makes no sense, for the very fact I just mentioned.

admax88qqq · on Jan 31, 2023

Two end to end networks would be more reliable.

Like two independent internets spanning from your server to my laptop.

Two completely isolated end to end transports.

That's what OP meant when they said you could make it more reliable by having a redundant network. It's just prohibitively expensive.

Then if one internet goes down in any way I talk to you over the other. That's a fairly straightforward fallback algorithm to implement.

allarm · on Jan 31, 2023

Actually I have seen a setup that was quite close to this. Two separate networks, one of them was completely isolated from another, didn’t have Internet access and used a separate set of network equipment. On top of that, the building itself had two entrances - one for the boss and another one for the personnel. You physically couldn’t get from one part of the building to another. It didn’t help the boss though - he was blown up in his car one day. Fun times.

wmf · on Jan 31, 2023

You're talking like multihoming doesn't work. Sure there are cases where bugs or bad configs can propagate across ASes but most of the time you can survive if one provider goes down.

bogomipz · on Jan 31, 2023

And that's exactly where the whole "have two backbones managed by separate teams instead of one" stops. If someone pushes out an incorrect network config to the end box then all that "let's have two of everything" becomes completely worthless. And as far as multihoming everything and having every single box on the network act as router, unless you are running a CDN of some sort, really makes zero sense. You seem to be arguing that adding more complexity will automatically result in better reliability.