> Everything above that just makes switching and buffering needlessly expensive,...

mertenVan · on Feb 19, 2020

I suggest you read the paper linked by the blog author (“Ethernet: Distributed Packet Switching for Local Computer Networks”), specifically Section 6 (performance and efficiency 6.3). It will answer all your questions.

> Why does a larger MTU make switching more expensive?

Switching requires storage of the entire packet in SRAM.

Larger MTU = More SRAM chips

If existing MTU is already 95% network efficient (see paper), then larger MTU is simply wasted money.

mprovost · on Feb 19, 2020

Traditionally it's been true that you need SRAM for the entire packet, which also increases latency since you have to wait for the entire packet to arrive down the wire before retransmitting it. But modern switches are often cut-through to reduce latency and start transmitting as soon as they see enough of the headers to make a decision about where to send it. This also means that they can't checksum the entire packet, which was another nice feature with having it all in memory. So if it detects corruption towards the end of the incoming packet it's too late since the start has already been sent - most switches will typically stamp over the remaining contents and send garbage so it fails a CRC check on the receiver.

Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number. Even the checksum in the TCP header stops being effective with larger frames, so you end up having to do checksums at the application level if you care about end to end data integrity.

https://tools.ietf.org/html/draft-ietf-tcpm-anumita-tcp-stro...

mertenVan · on Feb 19, 2020

You're describing cut-through switching [1]. Because of its disadvantage, it is usually limited to uses that require pure performance, such as HFT (High Frequency Trading). Traditional store-and-forward switching is still commonly used (or some hybrid approach).

"The advantage of this technique is speed; the disadvantage is that even frames with integrity problems are forwarded. Because of this disadvantage, cut-through switches were limited to specific positions within the network that required pure performance, and typically they were not tasked with performing extended functionality (core)." [2]

[1] https://en.wikipedia.org/wiki/Cut-through_switching

[2] http://www.pearsonitcertification.com/articles/article.aspx?...

Hikikomori · on Feb 20, 2020

Most 10GE+ datacenter switches use cut through switching, not uncommon at all, not just something HFT uses.

Bitflips are very rare in a datacenter environment, typically caused a bad cable that you can just replace or clean. And crc check is done at the receiving system or router anyway.

rmwaite · on Feb 20, 2020

Cut through switching and single pass processing is extremely common in data center architectures. This is not only for specific use cases - it is necessary to provide the capabilities beyond forwarding while still allowing maximum throughput.

mlyle · on Feb 19, 2020

> Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number.

Hmm. Why is this? It seems if we have a CRC-32 in Ethernet (and most other layer 2 protocols), we'll have a guarantee to reject certain types of defects entirely... But mostly we're relying on the fact that we'll have a 1 in 4B chance of accepting each bad frame. Having a bigger MTU means fewer frames to pass the same data, so it would seem to me we have a lower chance of accepting a bad frame per amount of end-user data passed.

TCP itself has a weak checksum at any length. The real risk is of hosts corrupting the frame between the actual CRCs in the link layer protocols. E.g. you receive frame, NIC sees it is good in its memory, then when DMA'd to bad host memory it is corrupted. TCP's sum is not great protection against this at any frame length.

mprovost · on Feb 19, 2020

The risk is that multiple bits in the same packet are flipped, which the CRC can’t detect. If the bit error rate of the medium is constant, then the larger the frame, the more likely that is to occur. Also as Ethernet speeds increase, the underlying BER stays the same (or gets worse) so the chances of encountering errors in a specific time period go up. 100G Ethernet transmits a scary amount of bits so something that would have been rare in 10Base-T might happen every few minutes.

mlyle · on Feb 20, 2020

Your claim was it related to MTU, which you're now moving away from:

> Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number.

Now we have a new claim:

> The risk is that multiple bits in the same packet are flipped, which the CRC can’t detect

Yes, that's always the risk. It's not can't detect-- it almost certainly detects it. It's just that it's not guaranteed to detect it.

It has nothing to do with MTU-- Even a 1500 MTU is much larger than the 4 octet error burst a CRC-32 is guaranteed to detect. On the other hand, the errored packet only has a 1 in 4 billion chance of getting through.

> 100G Ethernet transmits a scary amount of bits so something that would have been rare in 10Base-T might happen every few minutes.

The question is, what's the errored frame rate. 100G ethernet links have error rates (in CRC errored packets per second) compared to the 10baseT networks I administered. I used to see a few errors per day. Now I see a dozen errors on a circuit that's been up for a year (and maybe some of those were when I was plugging it in). 1 in 4 billion of those you're going to let through incorrectly.

Keep in mind faster ethernet has set tougher bit error rate requirements and we have an undetected packet error time of something like the age of the universe if links are delivering the BER in the standard.

(Of course, there's plenty of chance for even those frames that get through cause no actual problem-- even though the TCP checksum is weak, it's still going to catch a big fraction of the remaining frames).

The bigger issue is that if there's any bad memory, etc, ... there's no L2 CRC protecting it most of the time. And a frame that is garbled by some kind of DMA, bus, RAM, problem while not protected by the L2 CRC has a decent risk of getting past the weak TCP checksum.