Hacker News new | past | comments | ask | show | jobs | submit login
How 1500 bytes became the MTU of the internet (benjojo.co.uk)
433 points by petercooper on Feb 19, 2020 | hide | past | favorite | 162 comments



For.. reasons, I found myself having to make a 'driver' for a PoE+ sensing device this month. The manufacturer had an SDK, but compiling it requires an old version of Visual Studio - a bouquet of dependencies, and it had no OSX support. None of the bundled applications would do what I needed (namely, let me forward the raw sensing data to another application.. SOMEHOW).

The data isn't encoded in the usual ways, so even 4 hours of begging FFMPEG were to no avail.

A few glances at wireshark payloads, the roughly translated documentation, and weighing my options, I embarked on a harrowing journey to synthesize the correct incantation of bytes to get the device to give me what I needed.

I've never worked with RTP/RTSP prior to this -- and I was disheartened to see nodejs didn't have any nice libraries for them. Oh well, it's just udp when it comes down to it, right?

SO MY NAIVETE BEGOT A JOURNEY INTO THE DARKNESS. Being a bit of an unknown-unknown, this project did _not_ budget time for the effort this relatively impromptu initiative required. An element of sentimentality for the customer, and perhaps delusions of grandeur, I convinced myself I could just crunch it out in a few days.

A blur of coffee and 7 days straight crunch later, I built a daisy chain of crazy that achieved the goal I set out for. I read rfc3550 so many times I nearly have it committed to memory. The final task was to figure out how to forward the stream I had ensorcelled to another application. UDP seemed like the "right" choice, if I could preserve the heavy lifting I had accomplished to reassemble the frames of data.. MTU sizes are not big enough to accomodate this (hence probably why the device uses RTP, LOL.). OSX supports some hilariously massive MTU's (It's been a few days, but I want to say something like 13,000 bytes?) Still, I'd have to chunk and reasemble each frame into quarters. Having to write _additional_ client logic to handle drops and OOO and relying on OSX's embiggened MTU's when I wanted this to be relatively OS independent... and the SHIP OR DIE pressure from above made me do bad. At this point, I was so crunched out that the idea of writing reconnect logic and doing it with TCP was painful so I'm here to confess... I did bad...

The client application spawns a webserver, and the clients poll via HTTP at about 30HZ. Ahhh it's gross...

I'm basically adrift on a misery raft of my own manufacture. Maybe protobufs would be better? I've slept enough nights to take a melon baller to the bad parts..


https://en.m.wikipedia.org/wiki/Jumbo_frame

The wiki page talks about getting 5% more data through at full saturation but it doesn’t mention an important detail that I recall from when it was proposed.

It turned out with gigabit Ethernet or higher that a single TCP connection cannot saturate the channel with an MTU of 1500 bytes. The bandwidth went up but the latency did not go down, and ACKs don’t arrive fast enough to keep the sender from getting throttled by the TCP windowing algorithm.

If I have a typical network with a bunch of machines on it nattering at each other, that might not sound so bad. But when I really just need to get one big file or stream from one machine to another, it becomes a problem.

So they settled on a multiple of 1500 bytes to avoid uneven packet fragmentation (if you get half packets every nth packet you lose that much throughput). Somehow that multiple became 6.

And then other people wanted bigger or smaller and I’m not quite sure how OS X ended up with 13000. You’re gonna get 8x1500 + 1000 there. Or worse, 9000 + 4000.


In college I only had one group project, which scandalized me but apparently lots of others found this normal. We had to fire UDP packets over the network and feed them to an MJPeG card. You got more points based on the quality of the video stream.

My very industrious teammate did 75% of the work (4 man team, I did 20%, if you are generous with the value of debugging). One of the things we/he tried was to just drop packets that arrived out of order rather than reorder them. Turned out the reordering logic was reducing framerates. So he ran some trials and looked at OOO traffic, and across the three or so routers between source and sink he never observed a single packet arriving out of order. So we just dropped them instead and got ourselves a few more frames per second.


Tbh that's what most real time video/audio applications will do. Reordering adds latency and that is worse than the occasional dropped frame.


I can drop a frame, I can't casually drop misordered packets. It takes many packets to build a frame. I have to reorder interframe packets (actually I just insert-in-order). If I drop packets, I get data scrolling like a busted CRT raster.

I'm using a KoalaBarrel. Koalas receive envelopes full of eucalyptus leaves. Koalas have to eat their envelopes in order. First koala to get his full subscription becomes fat enough to crush all the koalas beneath him. Keep adding koalas. Disregard letters addressed to dead koalas.


Not saying this is the ideal solution, but you could just drop any frame that contains any out of order packets. If an out of order packet arrives, just drop the current frame and start ignoring packets until you find the start of another frame.


This is very dependent on the frame in question, iframes are much more valuable than p/b frames. If you get unlucky with dropped frames you can end up showing a lot of distorted nonsense to the end user.


> embiggened

For non-native speakers: embiggened means huge, enlarged, overgrown.

I am not a native speaker of English either


*For non-Simpsons watchers

The word was created as a joke in a Simpsons episode, a word used in Springfield only. It is described as "perfectly cromulent" by a Springfielder, which is evidently meant to mean "acceptable" or "ordinary" but is another Springfieldism.

The joke may be lost on future generations who don't realise they're not normal words.


Actually, "embiggened" is an actual word, though archaic, it's been around for over 130 years. The coinage of "cromulant" to describe it as such was the joke there, not "embiggen" itself.

Source: https://en.wiktionary.org/wiki/embiggen


The show writers thought they came up with the word on their own, they didn't know about the previous usage of the word in 1884 (the episode was written in 1996, the internet wasn't quite as full of facts back then), "embiggen" was still supposed to be a joke.


It was used once in 1884 and the writer there specifically said he invented it. There are no other recorded uses of the word before The Simpsons.


To be fair to everyone, I've had native English speakers tell me what I speak is barely English.


To be fair, I have had Americans complement me on how well I have learned English when I tell them I am from the United Kingdom...


You cant expect too much from College education https://www.youtube.com/watch?v=kRh1zXFKC_o


This needs to be its own post. Lol.


What does it sense that changes >=30 times a second?


I'm guessing video frames given ffmpeg was part of the story.


You only use the Real Time Protocol (RTP) when you need time sensitive data streaming (typically audio or video)


I was curious about that to. Lots of references to video related standards that imply its a PoE camera, but then why isn't the data encoded in the usual ways? What does that mean?


What codec would you use for a camera that captures not RGB, but poetry of the soul?

CONTEXTLESS, HEADERLESS, ENDLESS BYTE STREAMS OF COURSE, where the literal, idealized (remember udp) position of each byte is part of a vector in a non euclidean coordinate system.


> What codec would you use for a camera that captures not RGB, but poetry of the soul?

I would love to read a collaborative work between you and James Mickens -- this genre of writing seems sadly under-present in the computing world...


I appreciate the interest in listening to a simulcast of Harvard-Professor-Collaborates-With-A-Nobody-Hobo. I'll forward this to my agent.

My agent is a tin can. I think she used to hold beans. Sometimes I put a few smashed nickels in her and rattle. While I do this, I pretend she's reading me my messages, and I'm like "oh no, I would never consent to a biopic directed by THAT charlatan." and then we laugh and laugh.

Oh how we laugh.


Sounds like Crestron, and if so I feel for you.


For 802.11, the biggest overhead is not packet headers but the randomized medium aquisition time so as to minimize collisions. 1500 bytes is way too small here with modern 802.11, so if you only send one packet for each medium aquisition, you end up with something upwards of 90% overhead. The solution 802.11n and later uses here is to use Aggregate MPDUs (AMPDUs). For each medium aquisition, the sender can send multiple packets in a contiguous burst, up to 64 KBytes. This ends up adding a lot of mechanism, including a sliding window block ack, and it impacts queuing disciplines, rate adaptation and pretty much everything else. Life would be so much simpler if the MTU had simply grown over time in proportion to link speeds.


> Life would be so much simpler if the MTU had simply grown over time in proportion to link speeds.

The problem is that the world went wireless, so maximum link speeds grew a lot but minimum link speeds are still relatively low. A single 64kB packet tying up a link for multiple milliseconds—unconditionally delaying everything else in the queue by at least that much—is not what we want.


> The problem is that the world went wireless, so maximum link speeds grew a lot but minimum link speeds are still relatively low.

I would argue: the problem is that the MTU isn't negotiated at all, but especially not based on link availability.


IPv6 tries to solve this with path MTU discovery.


Yes, but IPv6 is still at a higher level than Ethernet, Wifi, et al and is therefore subject to the limitations of the lower level framing


Sure, I mean that's what pMTUd is all about. One big difference with IPv6: Routers can't fragment packets. They either send or they don't.


I thought so too, but apparently there is an IPv6 fragmentation extension and it's implemented by several operating systems.


Only the endpoints can fragment.


Sure?

At this point 1500 is the standard, we can’t ever hope to increase it without a way to negotiate the acceptable value across the entire transmission path - that’s what IPv6 gives us.


I'm not sure that negotiating the acceptable value across the entire transmission path is a reasonable thing to do. I'm not sure that IPv6 _should_ be aware of a minimum/maximum MTU of underlying transmission path particularly since that path can often change transparently and each segment is subject to different requirements.


Especially since there are a lot of low latency applications (games, etc.) that take advantage of being able to fit data in a single packet that will not be held up due to other applications sharing the link that might try to stuff larger packets down the link.


802.11 AMPDUs already tie up the link for ~4ms in normal operation. Without this, the medium acquisition overheads kill throughput. But you're correct that a single 64KB packet sent at MCS-0 would take a lot longer than that.

802.11 already includes a fragmentation and reassembly mechanism at the 802.11 level, distinct from any end-to-end IP fragmentation. Unlike IP fragmentation, fragments are retransmitted if lost. So you could use 802.11 fragmentation for large packets sent at slow link speeds to avoid tying up the link for a long time.


Is there a way to set a bigger MTU with wireless, like it is with wired ethernet?


> Life would be so much simpler if the MTU had simply grown over time in proportion to link speeds.

For every thing besides real-time maybe.


The M in MTU stands for maximum, not mandatory.


It ends up being mandatory if you're sharing a non-MIMO link with other systems that are using large packets.


Tell that to NetworkManager


I understand. I have architected networks for over a decade now. The real issue is serialization delay. If I have a tiny voice packet that has to wait to be physically transmitted behind a huge dump truck packet (big), it can still be a problem even with high speed links with regards to microbursts.


A single gigabit link can sends something in the order of 80,000 packets per second. If packets had a 9000 byte MTU, that would still be 12,000 packets per second. Having your smaller packets wait a at most an extra 0.02 milliseconds to be serialised onto a 1 gigabit physical link seems... rather unlikely to be a problem in the real world?


Software developer talking confidently about electrical engineering issues he knows nothing about. How cute. /s

All Ethernet adapters since the first Alto card had self-clocking data recovery [1].

Clock accuracy was never a problem, as long as it was withing the acceptable range required for PLL lock/track loop.

The reason for 1500 MTU is that for packet-based systems, you don't want infinitely large packets. You want small packets. but large enough so that packet overhead is insignificant, which in engineering terms means less than 2%-5% overhead. Thus 1500 max packet size. Everything above that just makes switching and buffering needlessly expensive, SRAM was hella expensive back then. Still is today (in terms of silicon area).

Look at all the memory chips on Xerox Alto's Ethernet board (below) - memory chips were already taking ~50% of the board area!

[1] Schematic of the original Alto Ethernet card clock recovery circuit: https://www.righto.com/2017/11/fixing-ethernet-board-from-vi...

EDIT: Lol! Author has completely replaced erroneous explanation with correct explanation, including link to seminal paper about packet switching. Good.


> How cute. /s

> EDIT: Lol! Author has completely replaced erroneous explanation with correct explanation, including link to seminal paper about packet switching. Good.

Don't be a jerk. Being right doesn't give you the right to make fun of people.


Noted. I find the over-confidence over a completely imagined issue funny and interesting. I make that mistake too. It's always interesting to do a post-mortem: why was I so confident? how did I miss the correct answer? I respect the author for doing such a fast turn around :)


Good to know that your intent wasn't malicious, but fwiw, I also didn't find the tone particularly appropriate for HN, either.


The hardware world is full of this.


I think because in the hardware world the cost for being wrong is so much higher since you can’t push out an over the air update or update your saas or whatever. So if you don’t know you stay quiet instead of being wrong.


> Everything above that just makes switching and buffering needlessly expensive, SRAM was hella expensive back then. Still is today (in terms of silicon area).

Why does a larger MTU make switching more expensive?

And why does it effect buffering? Won't the internal buses and data buffers of networking chips be disconnected from the MTU? Surely they'll be buffering in much smaller chunks, maybe dictated by their SRAM/DRAM technology. Otherwise, when you consider the vast amount of 64B packets, buffering with 1500B granularity would be extremely expensive.


I suggest you read the paper linked by the blog author (“Ethernet: Distributed Packet Switching for Local Computer Networks”), specifically Section 6 (performance and efficiency 6.3). It will answer all your questions.

> Why does a larger MTU make switching more expensive?

Switching requires storage of the entire packet in SRAM.

Larger MTU = More SRAM chips

If existing MTU is already 95% network efficient (see paper), then larger MTU is simply wasted money.


Traditionally it's been true that you need SRAM for the entire packet, which also increases latency since you have to wait for the entire packet to arrive down the wire before retransmitting it. But modern switches are often cut-through to reduce latency and start transmitting as soon as they see enough of the headers to make a decision about where to send it. This also means that they can't checksum the entire packet, which was another nice feature with having it all in memory. So if it detects corruption towards the end of the incoming packet it's too late since the start has already been sent - most switches will typically stamp over the remaining contents and send garbage so it fails a CRC check on the receiver.

Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number. Even the checksum in the TCP header stops being effective with larger frames, so you end up having to do checksums at the application level if you care about end to end data integrity.

https://tools.ietf.org/html/draft-ietf-tcpm-anumita-tcp-stro...


You're describing cut-through switching [1]. Because of its disadvantage, it is usually limited to uses that require pure performance, such as HFT (High Frequency Trading). Traditional store-and-forward switching is still commonly used (or some hybrid approach).

"The advantage of this technique is speed; the disadvantage is that even frames with integrity problems are forwarded. Because of this disadvantage, cut-through switches were limited to specific positions within the network that required pure performance, and typically they were not tasked with performing extended functionality (core)." [2]

[1] https://en.wikipedia.org/wiki/Cut-through_switching

[2] http://www.pearsonitcertification.com/articles/article.aspx?...


Most 10GE+ datacenter switches use cut through switching, not uncommon at all, not just something HFT uses.

Bitflips are very rare in a datacenter environment, typically caused a bad cable that you can just replace or clean. And crc check is done at the receiving system or router anyway.


Cut through switching and single pass processing is extremely common in data center architectures. This is not only for specific use cases - it is necessary to provide the capabilities beyond forwarding while still allowing maximum throughput.


> Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number.

Hmm. Why is this? It seems if we have a CRC-32 in Ethernet (and most other layer 2 protocols), we'll have a guarantee to reject certain types of defects entirely... But mostly we're relying on the fact that we'll have a 1 in 4B chance of accepting each bad frame. Having a bigger MTU means fewer frames to pass the same data, so it would seem to me we have a lower chance of accepting a bad frame per amount of end-user data passed.

TCP itself has a weak checksum at any length. The real risk is of hosts corrupting the frame between the actual CRCs in the link layer protocols. E.g. you receive frame, NIC sees it is good in its memory, then when DMA'd to bad host memory it is corrupted. TCP's sum is not great protection against this at any frame length.


The risk is that multiple bits in the same packet are flipped, which the CRC can’t detect. If the bit error rate of the medium is constant, then the larger the frame, the more likely that is to occur. Also as Ethernet speeds increase, the underlying BER stays the same (or gets worse) so the chances of encountering errors in a specific time period go up. 100G Ethernet transmits a scary amount of bits so something that would have been rare in 10Base-T might happen every few minutes.


Your claim was it related to MTU, which you're now moving away from:

> Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number.

Now we have a new claim:

> The risk is that multiple bits in the same packet are flipped, which the CRC can’t detect

Yes, that's always the risk. It's not can't detect-- it almost certainly detects it. It's just that it's not guaranteed to detect it.

It has nothing to do with MTU-- Even a 1500 MTU is much larger than the 4 octet error burst a CRC-32 is guaranteed to detect. On the other hand, the errored packet only has a 1 in 4 billion chance of getting through.

> 100G Ethernet transmits a scary amount of bits so something that would have been rare in 10Base-T might happen every few minutes.

The question is, what's the errored frame rate. 100G ethernet links have error rates (in CRC errored packets per second) compared to the 10baseT networks I administered. I used to see a few errors per day. Now I see a dozen errors on a circuit that's been up for a year (and maybe some of those were when I was plugging it in). 1 in 4 billion of those you're going to let through incorrectly.

Keep in mind faster ethernet has set tougher bit error rate requirements and we have an undetected packet error time of something like the age of the universe if links are delivering the BER in the standard.

(Of course, there's plenty of chance for even those frames that get through cause no actual problem-- even though the TCP checksum is weak, it's still going to catch a big fraction of the remaining frames).

The bigger issue is that if there's any bad memory, etc, ... there's no L2 CRC protecting it most of the time. And a frame that is garbled by some kind of DMA, bus, RAM, problem while not protected by the L2 CRC has a decent risk of getting past the weak TCP checksum.


It would be nice to corroborate this reason with another source, because my understanding is that clock synchronization was not a factor in determining the MTU, which seems really more like a OSI layer 2/3 consideration.

I am surprised the PLLs could not maintain the correct clocking signal, since the signal encodings for early ethernet were "self-clocking" [1,2,3] (so even if you transmitted all 0s or all 1s, you'd still see plenty of transitions on the wire).

Note that this is different from, for example, the color burst at the beginning of each line in color analog TV transmission [4]. It is also used to "train" a PLL, which is used to demodulate the color signal transmission. After the color burst is over, the PLL has nothing to synchronize to. But the 10base2/5/etc have a carrier throughout the entire transmission.

[1] https://en.wikipedia.org/wiki/Ethernet_physical_layer#Early_...

[2] https://en.wikipedia.org/wiki/10BASE2#Signal_encoding

[3] http://www.aholme.co.uk/Ethernet/EthernetRx.htm

[4] https://en.wikipedia.org/wiki/Colorburst


I also don't believe this is the reason. Early Ethernet physical standards used Manchester encoding to recover the data clock.


I would agree given I worked on an Ethernet chipset back in 1988/9 keeping the PLL synched was not a problem. I can't remember what the maximum packet size we supported was (my guess is 2048) but that was more of a buffering to SRAM and needing more space for counters.

The datasheet for the NS8391 has no such requirement for PLL sync.

https://archive.org/details/bitsavers_nationaldaDataCommunic...


As others have said, with Manchester encoding 10BASE2 is self-clocking, you can use the data to keep your PLL locked, just as you would on modern ethernet standards. However I imagine with these standards you may not even have needed an expensive/power-hungry PLL, probably you could just multi-sample at a higher clock rate like a UART did (I don't actually know how this silicon was designed in practice).

Futher PLLs have not got a lot better, but a lot worse. Maybe back when 10BASE2 was introduced you could train a PLL on 16 transitions and then have acquired lock but there's no way you can do that anymore (at modern data rates). PCI express takes thousands of transitions to exit L0s->L0, which is all to allow for PLL lock.

My best guess for the 1500 number is that with a 200ppm clock difference between the sender and receiver (the maximum allowed by the spec, which says your clock must be +-100ppm) then after 1500 bytes you have slipped 0.3 bytes. You don't want to slip more than half a byte during a packet as it may result in duplicated or skipped byte in your system clock domain. (2001e-6)1500=0.3.


I thought most Ethernet PHYs don't lock actually to the clock, but instead use a FIFO that starts draining once it's half way full. The size of this FIFO is such that it doesn't under or overflow given the largest frame size and worst case 200 PPM difference.

I figured this is what the interframe gap is for - to allow the FIFO to completely drain.


IFP is really more to let the receiver knows where one stream of bits stop and the next stream of bits start. How they handle the incoming spray of data is up to them on a queue/implementation level.


The original MTU was 576 bytes, enough for 512 bytes of payload plus 64 bytes for the IP and TCP header with a few options. 1500 bytes is a Berkeleyism, because their TCP was originally Ethernet-only.


Yeah, didn't T1 and ISDN use 576 to limit serialization delay and jitter? The backbone probably switched to 1500 when OC-3 was adopted.


The default MTU for a T1/E1 was usually 1500. The default for HSSI was 4470 which meant the default for DS3 circuits was 4470. This was also the usual default MTU for IP over ATM which is what most OC-3 circuits would have been using when they were initially rolled out for backbone use. This remained the usual default MTU all the way through OC-192 circuits running packet over sonnet.

I left the lSP backbone and large enterprise WAN field around that time and can't speak to more recent technologies.


IEEE 802 history is disappearing without a trace? Afaik it’s pretty well documented, you just need to be a member for some of the stuff.

http://www.ieee802.org/

I feel like the last piece we’re missing in this story is the performance impact of fragmentation. Like why not just set all new hardware to an MTU of 9000 and wait ten years?


>I feel like the last piece we’re missing in this story is the performance impact of fragmentation. Like why not just set all new hardware to an MTU of 9000 and wait ten years?

Because a node with a MTU of 9000 will very likely be unable to determine the MTU of every link in it's path. At best, you'll see fragmentation. At worst, the node's packets will be registered as interface errors when it encounters an interface lower than 9k. Neither of those are desirable.


> Like why not just set all new hardware to an MTU of 9000 and wait ten years?

The hardware in question is Ethernet NICs. However, for you to set the MTU on an Ethernet NIC to 9000, every device on the same Ethernet network (at least the same Ethernet VLAN), including all other NICs and switches, including ones which aren't connected yet, must also support and be configured for that MTU. And this also means you cannot use WiFi on that Ethernet network (since, at least last time I looked, WiFi cannot use a MTU that large).


Sending a jumbo frame down a line that has hardware that doesn’t support jumbo frames somewhere along the way does not mean the packet gets dropped. The NIC that would send the jumbo frame fragments the packet down to the lower MTU. So what’s the performance impact of that fragmentation? If it isn’t higher than the difference in bandwidth overhead from headers of 9000 MTU traffic vs. 1500 MTU traffic then why not transition to 9000 MTU?


But how does the NIC know that, 11 hops away, there is a layer 2 device, which cannot communicate with the NIC(switches do not typically have the ability to communicate directly with the devices generating the packets), that only supports a 1500 byte frame?

Now you need Path MTU discovery, which as the article indicates, has its own set of issues. (Overhead from trial and error, ICMP being blocked due to security concerns, etc...)


If you block ICMP you deserve what you get. Don't do this. (Edit: don't block ICMP)


So now you're trying to communicate from your home machine to some random host on the internet (website, VPS, streaming service), and you're configured for MTU 9000, the remote service is also configured for MTU 9000, but some transit provider in the middle is not, and they've disabled ICMP for $reasons.

They blocked ICMP, do you deserve what you get?


Transit providers should push packets and generally do. With PMTU failures it's usually clueless network admins on firewalls nearer endpoints. And no, you don't and I wish the admin responsible could feel your pain.


> Transit providers should

Agreed

> and generally do

Agreed.

Now if you can make it 'will always just push packets', we'll be golden.

Unfortunately, there are enough ATM/MPLS/SONET/etc networks being run by people who no longer understand what they're doing, that we're never going to get there.

To make matters more entertaining, IPv6 depends on icmp6 even more.


Why should it need to? Ethernet is designed to have non-deterministic paths (except in cases of automotive, industrial, and time sensitive networks). If you get to a hop that doesn’t support jumbo frames then break it into smaller frames and send them individually. The higher layers don’t care if the data comes in one frame or ten.


> Sending a jumbo frame down a line that has hardware that doesn’t support jumbo frames somewhere along the way does not mean the packet gets dropped

Almost all IP packets on the internet at large have the 'do not fragment' flag set. IP defragmentation performance ranges from pretty bad to an easy DDoS vector, so a lot of high traffic hosts drop fragments without processing them.

If we had truncation (with a flag) instead of fragmentation, that might have been usable, because the endpoints could determine in-band the max size datagram and communicate it and use that; but that's not what we have.


AFAIK, Ethernet has no support for fragmentation; I've never seen, in the Ethernet standards I've read (though I might have missed it), a field saying "this is a fragment of a larger frame". There's fragmentation in the IP layer, but it needs: (a) that the frame contains an IP packet; (b) that the IP packet can be fragmented (no "don't fragment" on IPv4, or a special header on IPv6); (c) that the sending host knows the receiving host's MTU; (d) that it's not a broadcast or multicast packet (which have no singular "receiving host").

You can have working fragmentation if you have two separate Ethernet segments, one for 1500 and the other for 9000, connected by an IP router; the cost (assuming no broken firewalls blocking the necessary ICMP packets, which sadly is still too common) is that the initial transmission will be resent since most modern IP stacks set the "don't fragment" bit (or don't include the extra header for IPv6 fragmentation).


It does mean packets sent to another local, non-routed, non-jumbo-frame interface would get lost. So you could, for example, maybe talk to the internet, but you couldn't print anything to the printer down the hall.


Fragmentation/reassembly is an l3 concept and not guaranteed to large MTUs when it is there.


PMTU-D will save their ass in some cases. But it's not safe to assume all routers in the path will respond to ICMP.


It doesn't matter that the routers respond to ICMP, it matters that they generate them, and that they're addressed properly, and that intermediate routers don't drop them.

Some routers will generate the ICMPs, but are rate limited, and the underlying poor configuration means that the rate limits are hit continously and most connections are effectively in a path mtu blackhole.


>It doesn't matter that the routers respond to ICMP, it matters that they generate them, and that they're addressed properly, and that intermediate routers don't drop them.

>Some routers will generate the ICMPs, but are rate limited, and the underlying poor configuration means that the rate limits are hit continously and most connections are effectively in a path mtu blackhole.

Sure. But I'm not about to sit here and name all the different reasons for folks. And since most here do not have a strong networking background running consumer grade routers at home, it seemed most applicable.

I could have used a more encompassing term like PMTU-D blackhole, but I didn't.


The worst case I just recently encountered with Jumbo Frames was with NetworkManager trying to follow local DNS server's advertised MTU but when the local interface doesn't support Jumbo Frames it just dies and keeps looping.

Even if you really want devices to use JF, some fail miserably because it's just not well thought out.


"why not just set all new hardware to an MTU of 9000"

Routers can fragment the packets, switches can't. So that would be pretty chaotic for non-techie installed equipment.


Last time I saw hardware Ethernet switch was 20 years ago. 8-[ ]


There's one in your house probably. It won't frag packets between your wired PC and your wired printer.

There are also certainly a shit load of them in closets and top-of-rack all over where I work.


Plenty of routers today that can't fragment packets. And they have rate limiters where they can only generate a small amount of ICMP 3/4s (maybe 50 a second).


My favorite Ethernet resource is Charles Spurgeon’s Ethernet (IEEE 802.3) Web Site: http://www.ethermanage.com/resources/

It used to have even more stuff, but I think he removed a lot when he got his book published.


The problem seems to be that both the IEEE and the IETF don't want to do anything.

IEEE could define a way to support larger frames. 'just wait 10 years' doesn't strike me as the best solution, but at least it is a solution. In my opinion a better way be if all devices would report the max frame length they support. Bridges would just report the minimum over all ports on the same vlan. When there are legacy devices that don't report anything, just stay at 1500.

IETF can also do someything today by having hosts probe the effective max. frame length. There are drafts but they don't go anywhere because too few people care.


The article talks about how the 1500 byte MTU came about but doesn't mention that the problem of clock recovery was solved by using 4b/5b or 8b/10b encoding when sending Ethernet through twisted-pair wiring. This encoding technique also provides a neutral voltage bias.

EDIT: As pointed out below, I failed to account for the clock-rate being 25% faster than the bit-rate in my original assertion that Ethernet over twisted-pair was only 80% efficient due to the encoding (see below)


> Ethernet through twisted-pair wiring only provides 80% of the listed bit-rate

Actually, they already accommodated for this in the advertised speed.

In other words, a 1 GbE SerDes runs at 1.250 Gbit/s, so you end up with an actual 1 Gbit/s bandwidth.

The reason you don't actually hit 1 Gbit/s in practice is due to other overheads such as the interframe gaps, preambles, FCS, etc.


You're absolutely correct ... it's been a long time since I was designing fiber transceivers but I should have remembered this. Ultimately efficiency is also affected by other layers of the protocol stack too (UDP versus TCP headers) which also explains why larger frames can be more efficient. In the early days of RTP and RTSP, there were many discussions about frame size, how it affected contention and prioritization and whether it actually helped to have super-frames if the intermediate networks were splitting and combining the frames anyway.


> The reason you don't actually hit 1 Gbit/s in practice is due to other overheads such as the interframe gaps, preambles, FCS, etc.

Actually Gigabit Ethernet is highly efficient; it can actually give you 98-99 % of line rate as the payload rate.


There has to be something else going on here because I routinely achieve > 800mbps on my gigabit network over copper.


I'm not a hardware engineer, but from some quick research it appears that 100 megabit ethernet ("fast ethernet") transmits at effectively 125 MHz. So the 100 megabit number describes the usable bit rate, not the electrical pulses on the wire.

Gigabit Ethernet is more complicated, and it uses multiple voltages and all four pairs of wires bidirectionally. So it is not just a single serial stream of on/off.


Yes, the wire symbol rates are higher. For instance, 100mbit Ethernet has a 125 million symbols per second wire rate.


I used to glue stuff together to FDDI rings and Token Ring networks back in the day (I used Xylan switches, which had ATM-25 UTP line cards among other long-forgotten oddities), and MTU sizes always struck me as being particularly arbitrary.

But I'm not really sure about the clock sync limitations being a factor here. It was way back in the deepest past.

What I do remember vividly is the mess that physical layer networking evolved into over the years thanks to dial-up and DSL (ever had to set your MTU to 1492 to accommodate an extra PPP header?).

And something is obviously wrong today, since we're still using the same baseline value for our gigabit fiber to the home connections, our 3/4/5G (scratch to taste) mobile phones, etc.


PPPoE was such an ugly mess. In its early days there were various server-side OSes (IRIX and one other I can't remember) that had piss-poor TCP/IP MTU implementations. The result was that you ran into random websites that just took forever to load as packets that happened to not fill the full 1492 limits eventually got the data through. By around 2003 i rarely encountered it anymore, but then I moved to a place with cable and never had to deal with it again.


Oh, and throw IPsec and a bunch of other protocols into the mix- networking is often a fragile beast:

https://www.networkworld.com/article/2224654/mtu-size-issues...


> ever had to set your MTU to 1492 to accommodate an extra PPP header?

I had to replace my Apple AirPort Extreme when I got gigabit fiber since it didn't have a manual MTU setting and it didn't autodetect the MTU properly over PPPoE... In 2020 I still need to manually set the MTU on my Ubiquiti USG...


> ever had to set your MTU to 1492 to accommodate an extra PPP header?

Ah, I was always wondering why my ISP configured my fiber modem's mtu to 1492. So it's due to using PPPoE? Is there no way to use bigger mtu when using PPPoE?


Nowadays, there's PPPoA (over ATM) which wraps at a lower level, and allows 1500 byte ethernet payloads through. But running the ethernet over ATM at 1508 MTU so that PPPoE would be 1500 was probably out of reach --- when PPPoE was introduced, the customer endpoint was often the customer PC, and some of those were using fairly old nics that might not have supported larger packets.

Sadly, smaller than 1500 byte MTUs still cause issues for some people to this day. It's all fine if everything is properly configured, or if at least everything sends and receives ICMP, but if something is silently dropping packets, you're in for a bad day. These days, I think it's usually problems with customers sending large packets, as opposed to early days where receiving large packets would routinely fail, but a lot of that is because large sites gave up on sending large packets.


Yes, PPPoA was also a thing I dealt with, and another source of irritating MTU issues.


RFC 4638 provides a mechanism for this. It relies on the Ethernet devices supporting a ‘baby jumbo’ MTU of 1508 bytes and support for it is still a bit scarce.

https://tools.ietf.org/html/rfc4638


When networks were new, computers connected to each other using a shared trunk that you physically drilled into. It's a non-trivial problem to send data over a shared channel; it's very easy for two systems to clobber each other. A primitive, but somewhat effective mechanism is ALOHA (https://en.wikipedia.org/wiki/ALOHAnet), where multiple senders randomly try to send their message to a single receiver. The single receiver then repeats back any messages it successfully receives. In that way the sender is able to confirm its message got through -- an ack. After a certain amount of time with no ack, senders repeat their messages. As you can imagine, shorter packets are less likely to cause collisions.

Ethernet uses something similar, but is able to detect if someone else is using the wire, called carrier sense. A short packet of 1500 bytes reduced the likelihood of collisions.


Does multiplexing over Ethernet exist?


Not anymore for all practical purposes, but it once did for the very old 10Base-2 standard for Ethernet over coaxial cable. This is practically why the old MII Ethernet PHY interface protocol had the collision-sense lines to indicate to the MAC to stop sending data if it detects incoming data, in attempts to minimize collisions.

https://en.wikipedia.org/wiki/10BASE2


This is very cool history, and something I never would have stumbled upon myself. Thank you for sharing! :-)


Unreliable IP fragmentation, and the brokenness of Path MTU Discovery (PMTUD), is causing the DNS folks to put a clamp on the size of (E)DNS message size:

* https://dnsflagday.net/2020/


I remembered something different related to shared medium and CMSA/CD where 1500 ensured fairness, and the minimum of 46 related to propagation time in the longest allowable cable

More at:

https://networkengineering.stackexchange.com/questions/2962/...


Minor factoid the article does not mention. ATM is an alternative to Ethernet that's used in many optical fiber environments. The "transfer unit" size of the ATM "cell" is 53 bytes (5 for the header and 48 for the payload). This is much smaller than 1500.

Another quirky story from the past: Sometime around 20 years ago I was having a bizarre networking problem. I could telnet into a host with no trouble, and the interactive session would be going just fine until I did something that produced a large volume of output (such as 'cat' on a large file). At that point the session would freeze and I would eventually get disconnected. After troubleshooting for a while I identified the problem as one of the Ethernet NICs on the client host. It was a premium NIC (3Com 3C509). Nonetheless, the NIC crystal oscillator frequency had drifted sufficiently that it would lose clock synchronization to the incoming frame if the MTU was larger than about 1000.


Speaking about ATM: the 48 byte payload was a standardization compromise between Europe and the US.

US companies had prototypes using 64 bytes, while European companies used 32 bytes. To avoid anyone giving a competitive advantage, they decided on a middle ground of 48.

There were trade-offs between 32 and 64 bytes as well: a 32 byte payload had a higher overhead than a 64 byte payload, but it had a shorter transmission time which made it easier to do voice echo cancellation.

Or so I was told many decades ago when I got introduced to ATM systems...


The picture in the article looks like a 3c509 - three media types and 100Mbs-1. Cor! I have loads of them somewhere. Plus a fair few 905 and 595s.


ATM is mostly dead. The only places it exists now is legacy deployments. Everyone has been deploying MPLS/IP instead of ATM for the past 15-20 years.


I think the author may have made a mistake in some of the math. The frame size distribution plots are likely based on the number of frames, not the amount of data contained in said frames. The 1500 byte and other large frames should therefore account for the lion's share of the actual data transferred. Correcting this error will totally change the final two graphs.


Yes. But only the "AMS-IX traffic by packet size range" graph is wildly inaccurate. Ethernet frame overhead is per-packet and presumably right.


Ah yeah, that's probably true. According to some back of the envelope math, it seems like the distribution should be more like 5%, 1%, 1%, 3%, 50%, 39%, ignoring the first and last size bins.


I find that technology cements in strata (the archaeology term) just as the layers that accumulate as the result of natural processes and human activity. The dynamics are not exactly the same but the tendency is similar. I wonder whether we'll always be capable of digging down deeper to the beginnings as things get more and more complicated.


I don't think the Ethernet Frame Overhead graph is correct. Surely the overhead is proportionally higher, per amount of data, for smaller packets. That graph shows that the overhead is just proportional to the amount of data sent, irrespective of the packet size, which can't be right.


The graph above that one is totally wrong. The frame overhead graph may be correct, though.


Kind of expected an article titled 'How 1500 bytes became the MTU of the internet' to tell us how 1500 bytes became the MTU of the internet.

Even I could of told you, 'the engineers at the time picked 1500 bytes'.


IIRC it was called “thinnet” (10B2). I loved the vampire taps on thick net.


Not my field, so I might be making an obvious error here, but:

If there are efficiency gains to be had from using jumbo frames, wouldn't setting my MTU to a multiple a 1500 still be of some benefit? If my PC, my switch and my router all support it, that would still be a tiiiny improvement. If the server's network does as well and let's say both of our direct providers, even if none of the exchanges or backbones in between do, that would still be an efficiency gain for ~10% of the link, right?


Locally you can set your MTU to larger than 1500, but if you (generally) try and send a packet towards the internet larger than 1500 it will be dropped without a trace, or it will be dropped and an ICMP message will be generated to tell your system to lower the MTU. Assuming you have not firewalled off ICMP ;)

As a handy feature on Linux at least, you can set your MTU to 9000 locally, and then set the default (internet generally) route to have a MTU of 1500 to prevent issues:

ip route add 0.0.0.0/0 via 10.11.11.1 mtu 1500


Huh, I wasn't aware that dropping large frames was so commonplace. I guess they don't want to use the CPU cycles to fragment them?


Over-sized packets can (and generally will) be fragmented by your router. It shouldn't be dropped unless you've intentionally set DNF.


Fragments are very hit or miss on the internet, https://blog.cloudflare.com/ip-fragmentation-is-broken/


AFAIK, most OSes today set DNF by default.


They will fragment them but many times you will see performance or other misc issues ... eventually.


Oh I never knew this. I wonder if I could enable Jumbo Frames to stream 4k content more efficiently on my local LAN.


You could...if the software on both your server and your media pc support large frames. And you're willing to deal with every once in a while some piece of software doing the wrong thing and sending out every packet with large MTU without doing detection on max packet size.


Potentially but troubleshooting performance issues from mismatched MTU can be brutal so most providers drop anything over 1500.

Many devices can do over 1500 but anyone who has done so without careful consideration knows the outcome isn't predictable unless everyone on the network is prepared to do so.

A dedicated / controlled SAN type environment can do it just fine, beyond that it can be difficult.


Off-topic but looking at that old network card picture reminded me of a very vague memory of more than one card with a component that looked like a capacitor, except it looked cracked.

Is my mind playing tricks? Were they faulty units or was there meant to be a crack?

This picture could be the same thing:

https://www.vogonswiki.com/images/3/37/Viglen_Ethergen_PnP_2...


That looks like a capacitor with a built-in spark gap. I've seen them in CRT circuitry. They're probably using it in the network card so if there's a huge over-voltage (e.g. lightning somewhere), it will jump across the spark gap, limiting damage.


Perfect! Thank you for solving something that has been lurking unresolved at the back of my mind for 20 years!

More info: https://electronics.stackexchange.com/questions/381682/what-...


Old network card eh? Its a 3Com 3C509:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

I still have a load of them gathering dust somewhere. However a system with ISA on it is a bit rare now and I'm not sure I can be bothered to compile a modern kernel small enough to boot on one. Besides, it will probably need cross compiling on something with some grunt that has heard of the G unit prefix.


For Old network card try 3C500, or something made by Excelan ;-)

http://www.os2museum.com/wp/emulating-etherlink/


So they just picked an arbitrary number that felt right? I expected the story to be more interesting than that, given the title. Still, there was some interesting trivia surrounding the core question.

Reminds me of the IPv6 adoption problem: https://news.ycombinator.com/item?id=14986324


> If we look at data from a major internet traffic exchange point (AMS-IX), we see that at least 20% of packets transiting the exchange are the maximum size.

He’s so optimistic. My brain heard this as “only 20% of packets […] are the maximum size”

What are all of those 64 byte packets? Interactive shells, or some other low bitrate protocol?


Note that maximum size is defined as >=1514, with 50% of packets being >= 1024. It very well may be that ~45% of packets are >= 1400 bytes.

The transfer graph is wrong - it shows packet count distribution, not size. Quick math says roughly 90% of transfer size are >= 1024 byte packets.


Probably mostly ACKs.


Well now I feel dumb.


I've always wondered how 9000 became "jumbo". Technically anything over 1500 is consider ju!no and there is no standard. The largest I've seen is 16k. I think there are some crc accuracy concerns at larger sizes but 9k still seems quite arbitrary for computer land.


The explanation according to https://web.archive.org/web/20010221204734/http://sd.wareone... is: "First because ethernet uses a 32 bit CRC that loses its effectiveness above about 12000 bytes. And secondly, 9000 was large enough to carry an 8 KB application datagram (e.g. NFS) plus packet header overhead."

That is, 9000 is the first multiple of 1500 which can carry an 8192-byte NFS packet (plus headers), while still being small enough that the Ethernet CRC has a good probability to detect errors.


Ethernet frame size was never strictly limited. The way the packet length works with Ethernet II frames (802.3 is more explicit, but never really caught on) is that the hardware needs to read all the way to the end of the packet and detect a valid CRC and a gap at the end before it knows the thing is done. So there's no reason beyond buffer size to put a fixed limit on it, and different hardware had different SRAM configuration.

Wikipedia has this link showing that 9000 bytes was picked by one site c. 2003 simply because it was generally well-supported by their existing hardware: https://noc.net.internet2.edu/i2network/jumbo-frames/rrsum-a...


This reminds me of various Windows applications back in the day (Windows 3.1 and 95) that claimed to fine tune your connection and one of the tricks they used was changing the MTU setting, as far as I can recall. Could anyone share how that worked?


If your computer sends a larger MTU than the next device upstream can handle, the packets will be fragmented leading to increased CPU usage, increased work by the driver, higher I/O on the network interface, higher CPU load on your router or modem, etc depending on where the bottleneck is. For example if you connect over Ethernet to a DSL modem, or to a router that has a DSL uplink, all your packets will be fragmented. This is because DSL uses 8 bytes per packet for PPPoE authentication. So if you send a 1500 byte packet to the modem, it will get broken up by the modem into 2 packets: one is 1492+8 bytes, and the other is 8+8 bytes.

But your PC is still sending more packets.. the modem is struggling to fragment them all and send them upstream.. its memory buffer is filling up.. your computer is retrying packets that it never got a response on..

By lowering your computer MTU to 1492 to start with, you avoid the extra work by the modem, which can have considerable speed increase.


Probably a dumb question, why the maximum size (and the one has most of packages) in the AMS-IX graph 1514 bytes instead of 1500 bytes that got discussed in the article?


1500 bytes is the MTU of IP, in most cases. It often excludes the Ethernet header, which is 14 bytes excluding the FCS, preamble, IFG, and any VLANs.

If have a 1500 byte MTU for IP, then we need at least a 1514 byte MTU for IP + Ethernet. We often call the > 1514B MTU the "interface MTU". It's unnecessarily confusing.


Actually, MTUs below 1500 bytes are pretty common, e.g. with PPP over Ethernet or other forms of encapsulation/tunneling.


I think you’re saying that the smallest bucket of packets are all packets that would have been combined with a larger packet of that had been an option... but that doesn’t make sense. That class of packets includes TCP SYN, ACK, RST, and 128 bytes could fit an entire instant message on many protocols.


Looks like a ripe low-hanging fruit for SpaceX Starlink to pick..


Why the downvote? possibly facilitating the end-to-end transport will allow them to offer jambo packets


This would only be possible if you were talking from a jumbo-configured client (let's say you've set up your laptop correctly), across a jumbo-configured network (Starlink, in your scenario), to a jumbo-configured server (here's the problem).

The problem is that Starlink only controls the steps from your router to "the internet". If you're trying to talk to spacex.com it'd be possible, but if you're trying to talk to google.com then now you need Starlink to be peering with ISPs that have jumbo frames, and they need to peer with ISPs with jumbo frames, etc etc and then also google's servers need to support jumbo frames.

Basically, the problem is that Starlink is not actually end to end, if you're trying to reach arbitrary servers on the internet. It just connects you to the rest of the internet, and you're back to where you started.

This is also true for any other ISP, Starlink is not special in this regard.


True, you'd expect endpoints to support Jumbo Frames as well, but why not start at least making it possible. It's a dead loop otherwise. IPv6 was the same at start.


Well, depending on the quality of the connections, a re-transmit of a jumbo frame could mean having to re-transmit a lot more data.

But since the local network and the end network where the servers are located will almost certainly be 1500, the point is almost all but moot.


Most connections will not be peer-to-peer over Starlink, so you need to deal with the least common denominator.


Because you don't know what you're talking about and are engaging in "what if"-isms? There is no business case to solve with jumbos frames over the Internet. I've been in this business for 20 years. Seen this argument a dozen times. It never changes.


Nah, it’s 1492 forever!


Found the ADSL user.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: