For.. reasons, I found myself having to make a 'driver' for a PoE+ sensing device this month. The manufacturer had an SDK, but compiling it requires an old version of Visual Studio - a bouquet of dependencies, and it had no OSX support. None of the bundled applications would do what I needed (namely, let me forward the raw sensing data to another application.. SOMEHOW).
The data isn't encoded in the usual ways, so even 4 hours of begging FFMPEG were to no avail.
A few glances at wireshark payloads, the roughly translated documentation, and weighing my options, I embarked on a harrowing journey to synthesize the correct incantation of bytes to get the device to give me what I needed.
I've never worked with RTP/RTSP prior to this -- and I was disheartened to see nodejs didn't have any nice libraries for them. Oh well, it's just udp when it comes down to it, right?
SO MY NAIVETE BEGOT A JOURNEY INTO THE DARKNESS. Being a bit of an unknown-unknown, this project did _not_ budget time for the effort this relatively impromptu initiative required. An element of sentimentality for the customer, and perhaps delusions of grandeur, I convinced myself I could just crunch it out in a few days.
A blur of coffee and 7 days straight crunch later, I built a daisy chain of crazy that achieved the goal I set out for. I read rfc3550 so many times I nearly have it committed to memory. The final task was to figure out how to forward the stream I had ensorcelled to another application. UDP seemed like the "right" choice, if I could preserve the heavy lifting I had accomplished to reassemble the frames of data.. MTU sizes are not big enough to accomodate this (hence probably why the device uses RTP, LOL.). OSX supports some hilariously massive MTU's (It's been a few days, but I want to say something like 13,000 bytes?) Still, I'd have to chunk and reasemble each frame into quarters. Having to write _additional_ client logic to handle drops and OOO and relying on OSX's embiggened MTU's when I wanted this to be relatively OS independent... and the SHIP OR DIE pressure from above made me do bad. At this point, I was so crunched out that the idea of writing reconnect logic and doing it with TCP was painful so I'm here to confess... I did bad...
The client application spawns a webserver, and the clients poll via HTTP at about 30HZ. Ahhh it's gross...
I'm basically adrift on a misery raft of my own manufacture. Maybe protobufs would be better? I've slept enough nights to take a melon baller to the bad parts..
The wiki page talks about getting 5% more data through at full saturation but it doesn’t mention an important detail that I recall from when it was proposed.
It turned out with gigabit Ethernet or higher that a single TCP connection cannot saturate the channel with an MTU of 1500 bytes. The bandwidth went up but the latency did not go down, and ACKs don’t arrive fast enough to keep the sender from getting throttled by the TCP windowing algorithm.
If I have a typical network with a bunch of machines on it nattering at each other, that might not sound so bad. But when I really just need to get one big file or stream from one machine to another, it becomes a problem.
So they settled on a multiple of 1500 bytes to avoid uneven packet fragmentation (if you get half packets every nth packet you lose that much throughput). Somehow that multiple became 6.
And then other people wanted bigger or smaller and I’m not quite sure how OS X ended up with 13000. You’re gonna get 8x1500 + 1000 there. Or worse, 9000 + 4000.
In college I only had one group project, which scandalized me but apparently lots of others found this normal. We had to fire UDP packets over the network and feed them to an MJPeG card. You got more points based on the quality of the video stream.
My very industrious teammate did 75% of the work (4 man team, I did 20%, if you are generous with the value of debugging). One of the things we/he tried was to just drop packets that arrived out of order rather than reorder them. Turned out the reordering logic was reducing framerates. So he ran some trials and looked at OOO traffic, and across the three or so routers between source and sink he never observed a single packet arriving out of order. So we just dropped them instead and got ourselves a few more frames per second.
I can drop a frame, I can't casually drop misordered packets. It takes many packets to build a frame. I have to reorder interframe packets (actually I just insert-in-order). If I drop packets, I get data scrolling like a busted CRT raster.
I'm using a KoalaBarrel. Koalas receive envelopes full of eucalyptus leaves. Koalas have to eat their envelopes in order. First koala to get his full subscription becomes fat enough to crush all the koalas beneath him. Keep adding koalas. Disregard letters addressed to dead koalas.
Not saying this is the ideal solution, but you could just drop any frame that contains any out of order packets. If an out of order packet arrives, just drop the current frame and start ignoring packets until you find the start of another frame.
This is very dependent on the frame in question, iframes are much more valuable than p/b frames. If you get unlucky with dropped frames you can end up showing a lot of distorted nonsense to the end user.
The word was created as a joke in a Simpsons episode, a word used in Springfield only. It is described as "perfectly cromulent" by a Springfielder, which is evidently meant to mean "acceptable" or "ordinary" but is another Springfieldism.
The joke may be lost on future generations who don't realise they're not normal words.
Actually, "embiggened" is an actual word, though archaic, it's been around for over 130 years. The coinage of "cromulant" to describe it as such was the joke there, not "embiggen" itself.
The show writers thought they came up with the word on their own, they didn't know about the previous usage of the word in 1884 (the episode was written in 1996, the internet wasn't quite as full of facts back then), "embiggen" was still supposed to be a joke.
I was curious about that to. Lots of references to video related standards that imply its a PoE camera, but then why isn't the data encoded in the usual ways? What does that mean?
What codec would you use for a camera that captures not RGB, but poetry of the soul?
CONTEXTLESS, HEADERLESS, ENDLESS BYTE STREAMS OF COURSE, where the literal, idealized (remember udp) position of each byte is part of a vector in a non euclidean coordinate system.
I appreciate the interest in listening to a simulcast of Harvard-Professor-Collaborates-With-A-Nobody-Hobo. I'll forward this to my agent.
My agent is a tin can. I think she used to hold beans. Sometimes I put a few smashed nickels in her and rattle. While I do this, I pretend she's reading me my messages, and I'm like "oh no, I would never consent to a biopic directed by THAT charlatan." and then we laugh and laugh.
For 802.11, the biggest overhead is not packet headers but the randomized medium aquisition time so as to minimize collisions. 1500 bytes is way too small here with modern 802.11, so if you only send one packet for each medium aquisition, you end up with something upwards of 90% overhead. The solution 802.11n and later uses here is to use Aggregate MPDUs (AMPDUs). For each medium aquisition, the sender can send multiple packets in a contiguous burst, up to 64 KBytes. This ends up adding a lot of mechanism, including a sliding window block ack, and it impacts queuing disciplines, rate adaptation and pretty much everything else. Life would be so much simpler if the MTU had simply grown over time in proportion to link speeds.
> Life would be so much simpler if the MTU had simply grown over time in proportion to link speeds.
The problem is that the world went wireless, so maximum link speeds grew a lot but minimum link speeds are still relatively low. A single 64kB packet tying up a link for multiple milliseconds—unconditionally delaying everything else in the queue by at least that much—is not what we want.
At this point 1500 is the standard, we can’t ever hope to increase it without a way to negotiate the acceptable value across the entire transmission path - that’s what IPv6 gives us.
I'm not sure that negotiating the acceptable value across the entire transmission path is a reasonable thing to do. I'm not sure that IPv6 _should_ be aware of a minimum/maximum MTU of underlying transmission path particularly since that path can often change transparently and each segment is subject to different requirements.
Especially since there are a lot of low latency applications (games, etc.) that take advantage of being able to fit data in a single packet that will not be held up due to other applications sharing the link that might try to stuff larger packets down the link.
802.11 AMPDUs already tie up the link for ~4ms in normal operation. Without this, the medium acquisition overheads kill throughput. But you're correct that a single 64KB packet sent at MCS-0 would take a lot longer than that.
802.11 already includes a fragmentation and reassembly mechanism at the 802.11 level, distinct from any end-to-end IP fragmentation. Unlike IP fragmentation, fragments are retransmitted if lost. So you could use 802.11 fragmentation for large packets sent at slow link speeds to avoid tying up the link for a long time.
I understand. I have architected networks for over a decade now. The real issue is serialization delay. If I have a tiny voice packet that has to wait to be physically transmitted behind a huge dump truck packet (big), it can still be a problem even with high speed links with regards to microbursts.
A single gigabit link can sends something in the order of 80,000 packets per second. If packets had a 9000 byte MTU, that would still be 12,000 packets per second. Having your smaller packets wait a at most an extra 0.02 milliseconds to be serialised onto a 1 gigabit physical link seems... rather unlikely to be a problem in the real world?
Software developer talking confidently about electrical engineering issues he knows nothing about. How cute. /s
All Ethernet adapters since the first Alto card had self-clocking data recovery [1].
Clock accuracy was never a problem, as long as it was withing the acceptable range required for PLL lock/track loop.
The reason for 1500 MTU is that for packet-based systems, you don't want infinitely large packets. You want small packets. but large enough so that packet overhead is insignificant, which in engineering terms means less than 2%-5% overhead. Thus 1500 max packet size. Everything above that just makes switching and buffering needlessly expensive, SRAM was hella expensive back then. Still is today (in terms of silicon area).
Look at all the memory chips on Xerox Alto's Ethernet board (below) - memory chips were already taking ~50% of the board area!
EDIT: Lol! Author has completely replaced erroneous explanation with correct explanation, including link to seminal paper about packet switching. Good.
> EDIT: Lol! Author has completely replaced erroneous explanation with correct explanation, including link to seminal paper about packet switching. Good.
Don't be a jerk. Being right doesn't give you the right to make fun of people.
Noted. I find the over-confidence over a completely imagined issue funny and interesting. I make that mistake too. It's always interesting to do a post-mortem: why was I so confident? how did I miss the correct answer? I respect the author for doing such a fast turn around :)
I think because in the hardware world the cost for being wrong is so much higher since you can’t push out an over the air update or update your saas or whatever. So if you don’t know you stay quiet instead of being wrong.
> Everything above that just makes switching and buffering needlessly expensive, SRAM was hella expensive back then. Still is today (in terms of silicon area).
Why does a larger MTU make switching more expensive?
And why does it effect buffering? Won't the internal buses and data buffers of networking chips be disconnected from the MTU? Surely they'll be buffering in much smaller chunks, maybe dictated by their SRAM/DRAM technology. Otherwise, when you consider the vast amount of 64B packets, buffering with 1500B granularity would be extremely expensive.
I suggest you read the paper linked by the blog author (“Ethernet: Distributed Packet Switching for Local Computer Networks”), specifically Section 6 (performance and efficiency 6.3). It will answer all your questions.
> Why does a larger MTU make switching more expensive?
Switching requires storage of the entire packet in SRAM.
Larger MTU = More SRAM chips
If existing MTU is already 95% network efficient (see paper), then larger MTU is simply wasted money.
Traditionally it's been true that you need SRAM for the entire packet, which also increases latency since you have to wait for the entire packet to arrive down the wire before retransmitting it. But modern switches are often cut-through to reduce latency and start transmitting as soon as they see enough of the headers to make a decision about where to send it. This also means that they can't checksum the entire packet, which was another nice feature with having it all in memory. So if it detects corruption towards the end of the incoming packet it's too late since the start has already been sent - most switches will typically stamp over the remaining contents and send garbage so it fails a CRC check on the receiver.
Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number. Even the checksum in the TCP header stops being effective with larger frames, so you end up having to do checksums at the application level if you care about end to end data integrity.
You're describing cut-through switching [1]. Because of its disadvantage, it is usually limited to uses that require pure performance, such as HFT (High Frequency Trading). Traditional store-and-forward switching is still commonly used (or some hybrid approach).
"The advantage of this technique is speed; the disadvantage is that even frames with integrity problems are forwarded. Because of this disadvantage, cut-through switches were limited to specific positions within the network that required pure performance, and typically they were not tasked with performing extended functionality (core)." [2]
Most 10GE+ datacenter switches use cut through switching, not uncommon at all, not just something HFT uses.
Bitflips are very rare in a datacenter environment, typically caused a bad cable that you can just replace or clean. And crc check is done at the receiving system or router anyway.
Cut through switching and single pass processing is extremely common in data center architectures. This is not only for specific use cases - it is necessary to provide the capabilities beyond forwarding while still allowing maximum throughput.
> Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number.
Hmm. Why is this? It seems if we have a CRC-32 in Ethernet (and most other layer 2 protocols), we'll have a guarantee to reject certain types of defects entirely... But mostly we're relying on the fact that we'll have a 1 in 4B chance of accepting each bad frame. Having a bigger MTU means fewer frames to pass the same data, so it would seem to me we have a lower chance of accepting a bad frame per amount of end-user data passed.
TCP itself has a weak checksum at any length. The real risk is of hosts corrupting the frame between the actual CRCs in the link layer protocols. E.g. you receive frame, NIC sees it is good in its memory, then when DMA'd to bad host memory it is corrupted. TCP's sum is not great protection against this at any frame length.
The risk is that multiple bits in the same packet are flipped, which the CRC can’t detect. If the bit error rate of the medium is constant, then the larger the frame, the more likely that is to occur. Also as Ethernet speeds increase, the underlying BER stays the same (or gets worse) so the chances of encountering errors in a specific time period go up. 100G Ethernet transmits a scary amount of bits so something that would have been rare in 10Base-T might happen every few minutes.
Your claim was it related to MTU, which you're now moving away from:
> Which raises another point in relation to the 1500 MTU - all of the CRC checks in various protocols were designed around that number.
Now we have a new claim:
> The risk is that multiple bits in the same packet are flipped, which the CRC can’t detect
Yes, that's always the risk. It's not can't detect-- it almost certainly detects it. It's just that it's not guaranteed to detect it.
It has nothing to do with MTU-- Even a 1500 MTU is much larger than the 4 octet error burst a CRC-32 is guaranteed to detect. On the other hand, the errored packet only has a 1 in 4 billion chance of getting through.
> 100G Ethernet transmits a scary amount of bits so something that would have been rare in 10Base-T might happen every few minutes.
The question is, what's the errored frame rate. 100G ethernet links have error rates (in CRC errored packets per second) compared to the 10baseT networks I administered. I used to see a few errors per day. Now I see a dozen errors on a circuit that's been up for a year (and maybe some of those were when I was plugging it in). 1 in 4 billion of those you're going to let through incorrectly.
Keep in mind faster ethernet has set tougher bit error rate requirements and we have an undetected packet error time of something like the age of the universe if links are delivering the BER in the standard.
(Of course, there's plenty of chance for even those frames that get through cause no actual problem-- even though the TCP checksum is weak, it's still going to catch a big fraction of the remaining frames).
The bigger issue is that if there's any bad memory, etc, ... there's no L2 CRC protecting it most of the time. And a frame that is garbled by some kind of DMA, bus, RAM, problem while not protected by the L2 CRC has a decent risk of getting past the weak TCP checksum.
It would be nice to corroborate this reason with another source, because my understanding is that clock synchronization was not a factor in determining the MTU, which seems really more like a OSI layer 2/3 consideration.
I am surprised the PLLs could not maintain the correct clocking signal, since the signal encodings for early ethernet were "self-clocking" [1,2,3] (so even if you transmitted all 0s or all 1s, you'd still see plenty of transitions on the wire).
Note that this is different from, for example, the color burst at the beginning of each line in color analog TV transmission [4]. It is also used to "train" a PLL, which is used to demodulate the color signal transmission. After the color burst is over, the PLL has nothing to synchronize to. But the 10base2/5/etc have a carrier throughout the entire transmission.
I would agree given I worked on an Ethernet chipset back in 1988/9 keeping the PLL synched was not a problem. I can't remember what the maximum packet size we supported was (my guess is 2048) but that was more of a buffering to SRAM and needing more space for counters.
The datasheet for the NS8391 has no such requirement for PLL sync.
As others have said, with Manchester encoding 10BASE2 is self-clocking, you can use the data to keep your PLL locked, just as you would on modern ethernet standards. However I imagine with these standards you may not even have needed an expensive/power-hungry PLL, probably you could just multi-sample at a higher clock rate like a UART did (I don't actually know how this silicon was designed in practice).
Futher PLLs have not got a lot better, but a lot worse. Maybe back when 10BASE2 was introduced you could train a PLL on 16 transitions and then have acquired lock but there's no way you can do that anymore (at modern data rates). PCI express takes thousands of transitions to exit L0s->L0, which is all to allow for PLL lock.
My best guess for the 1500 number is that with a 200ppm clock difference between the sender and receiver (the maximum allowed by the spec, which says your clock must be +-100ppm) then after 1500 bytes you have slipped 0.3 bytes. You don't want to slip more than half a byte during a packet as it may result in duplicated or skipped byte in your system clock domain. (2001e-6)1500=0.3.
I thought most Ethernet PHYs don't lock actually to the clock, but instead use a FIFO that starts draining once it's half way full. The size of this FIFO is such that it doesn't under or overflow given the largest frame size and worst case 200 PPM difference.
I figured this is what the interframe gap is for - to allow the FIFO to completely drain.
IFP is really more to let the receiver knows where one stream of bits stop and the next stream of bits start. How they handle the incoming spray of data is up to them on a queue/implementation level.
The original MTU was 576 bytes, enough for 512 bytes of payload plus 64 bytes for the IP and TCP header with a few options. 1500 bytes is a Berkeleyism, because their TCP was originally Ethernet-only.
The default MTU for a T1/E1 was usually 1500. The default for HSSI was 4470 which meant the default for DS3 circuits was 4470. This was also the usual default MTU for IP over ATM which is what most OC-3 circuits would have been using when they were initially rolled out for backbone use. This remained the usual default MTU all the way through OC-192 circuits running packet over sonnet.
I left the lSP backbone and large enterprise WAN field around that time and can't speak to more recent technologies.
I feel like the last piece we’re missing in this story is the performance impact of fragmentation. Like why not just set all new hardware to an MTU of 9000 and wait ten years?
>I feel like the last piece we’re missing in this story is the performance impact of fragmentation. Like why not just set all new hardware to an MTU of 9000 and wait ten years?
Because a node with a MTU of 9000 will very likely be unable to determine the MTU of every link in it's path. At best, you'll see fragmentation. At worst, the node's packets will be registered as interface errors when it encounters an interface lower than 9k. Neither of those are desirable.
> Like why not just set all new hardware to an MTU of 9000 and wait ten years?
The hardware in question is Ethernet NICs. However, for you to set the MTU on an Ethernet NIC to 9000, every device on the same Ethernet network (at least the same Ethernet VLAN), including all other NICs and switches, including ones which aren't connected yet, must also support and be configured for that MTU. And this also means you cannot use WiFi on that Ethernet network (since, at least last time I looked, WiFi cannot use a MTU that large).
Sending a jumbo frame down a line that has hardware that doesn’t support jumbo frames somewhere along the way does not mean the packet gets dropped. The NIC that would send the jumbo frame fragments the packet down to the lower MTU. So what’s the performance impact of that fragmentation? If it isn’t higher than the difference in bandwidth overhead from headers of 9000 MTU traffic vs. 1500 MTU traffic then why not transition to 9000 MTU?
But how does the NIC know that, 11 hops away, there is a layer 2 device, which cannot communicate with the NIC(switches do not typically have the ability to communicate directly with the devices generating the packets), that only supports a 1500 byte frame?
Now you need Path MTU discovery, which as the article indicates, has its own set of issues. (Overhead from trial and error, ICMP being blocked due to security concerns, etc...)
So now you're trying to communicate from your home machine to some random host on the internet (website, VPS, streaming service), and you're configured for MTU 9000, the remote service is also configured for MTU 9000, but some transit provider in the middle is not, and they've disabled ICMP for $reasons.
Transit providers should push packets and generally do. With PMTU failures it's usually clueless network admins on firewalls nearer endpoints. And no, you don't and I wish the admin responsible could feel your pain.
Now if you can make it 'will always just push packets', we'll be golden.
Unfortunately, there are enough ATM/MPLS/SONET/etc networks being run by people who no longer understand what they're doing, that we're never going to get there.
To make matters more entertaining, IPv6 depends on icmp6 even more.
Why should it need to? Ethernet is designed to have non-deterministic paths (except in cases of automotive, industrial, and time sensitive networks). If you get to a hop that doesn’t support jumbo frames then break it into smaller frames and send them individually. The higher layers don’t care if the data comes in one frame or ten.
> Sending a jumbo frame down a line that has hardware that doesn’t support jumbo frames somewhere along the way does not mean the packet gets dropped
Almost all IP packets on the internet at large have the 'do not fragment' flag set. IP defragmentation performance ranges from pretty bad to an easy DDoS vector, so a lot of high traffic hosts drop fragments without processing them.
If we had truncation (with a flag) instead of fragmentation, that might have been usable, because the endpoints could determine in-band the max size datagram and communicate it and use that; but that's not what we have.
AFAIK, Ethernet has no support for fragmentation; I've never seen, in the Ethernet standards I've read (though I might have missed it), a field saying "this is a fragment of a larger frame". There's fragmentation in the IP layer, but it needs: (a) that the frame contains an IP packet; (b) that the IP packet can be fragmented (no "don't fragment" on IPv4, or a special header on IPv6); (c) that the sending host knows the receiving host's MTU; (d) that it's not a broadcast or multicast packet (which have no singular "receiving host").
You can have working fragmentation if you have two separate Ethernet segments, one for 1500 and the other for 9000, connected by an IP router; the cost (assuming no broken firewalls blocking the necessary ICMP packets, which sadly is still too common) is that the initial transmission will be resent since most modern IP stacks set the "don't fragment" bit (or don't include the extra header for IPv6 fragmentation).
It does mean packets sent to another local, non-routed, non-jumbo-frame interface would get lost. So you could, for example, maybe talk to the internet, but you couldn't print anything to the printer down the hall.
It doesn't matter that the routers respond to ICMP, it matters that they generate them, and that they're addressed properly, and that intermediate routers don't drop them.
Some routers will generate the ICMPs, but are rate limited, and the underlying poor configuration means that the rate limits are hit continously and most connections are effectively in a path mtu blackhole.
>It doesn't matter that the routers respond to ICMP, it matters that they generate them, and that they're addressed properly, and that intermediate routers don't drop them.
>Some routers will generate the ICMPs, but are rate limited, and the underlying poor configuration means that the rate limits are hit continously and most connections are effectively in a path mtu blackhole.
Sure. But I'm not about to sit here and name all the different reasons for folks. And since most here do not have a strong networking background running consumer grade routers at home, it seemed most applicable.
I could have used a more encompassing term like PMTU-D blackhole, but I didn't.
The worst case I just recently encountered with Jumbo Frames was with NetworkManager trying to follow local DNS server's advertised MTU but when the local interface doesn't support Jumbo Frames it just dies and keeps looping.
Even if you really want devices to use JF, some fail miserably because it's just not well thought out.
Plenty of routers today that can't fragment packets. And they have rate limiters where they can only generate a small amount of ICMP 3/4s (maybe 50 a second).
The problem seems to be that both the IEEE and the IETF don't want to do anything.
IEEE could define a way to support larger frames. 'just wait 10 years' doesn't strike me as the best solution, but at least it is a solution. In my opinion a better way be if all devices would report the max frame length they support. Bridges would just report the minimum over all ports on the same vlan. When there are legacy devices that don't report anything, just stay at 1500.
IETF can also do someything today by having hosts probe the effective max. frame length. There are drafts but they don't go anywhere because too few people care.
The article talks about how the 1500 byte MTU came about but doesn't mention that the problem of clock recovery was solved by using 4b/5b or 8b/10b encoding when sending Ethernet through twisted-pair wiring. This encoding technique also provides a neutral voltage bias.
EDIT: As pointed out below, I failed to account for the clock-rate being 25% faster than the bit-rate in my original assertion that Ethernet over twisted-pair was only 80% efficient due to the encoding (see below)
You're absolutely correct ... it's been a long time since I was designing fiber transceivers but I should have remembered this. Ultimately efficiency is also affected by other layers of the protocol stack too (UDP versus TCP headers) which also explains why larger frames can be more efficient. In the early days of RTP and RTSP, there were many discussions about frame size, how it affected contention and prioritization and whether it actually helped to have super-frames if the intermediate networks were splitting and combining the frames anyway.
I'm not a hardware engineer, but from some quick research it appears that 100 megabit ethernet ("fast ethernet") transmits at effectively 125 MHz. So the 100 megabit number describes the usable bit rate, not the electrical pulses on the wire.
Gigabit Ethernet is more complicated, and it uses multiple voltages and all four pairs of wires bidirectionally. So it is not just a single serial stream of on/off.
I used to glue stuff together to FDDI rings and Token Ring networks back in the day (I used Xylan switches, which had ATM-25 UTP line cards among other long-forgotten oddities), and MTU sizes always struck me as being particularly arbitrary.
But I'm not really sure about the clock sync limitations being a factor here. It was way back in the deepest past.
What I do remember vividly is the mess that physical layer networking evolved into over the years thanks to dial-up and DSL (ever had to set your MTU to 1492 to accommodate an extra PPP header?).
And something is obviously wrong today, since we're still using the same baseline value for our gigabit fiber to the home connections, our 3/4/5G (scratch to taste) mobile phones, etc.
PPPoE was such an ugly mess. In its early days there were various server-side OSes (IRIX and one other I can't remember) that had piss-poor TCP/IP MTU implementations. The result was that you ran into random websites that just took forever to load as packets that happened to not fill the full 1492 limits eventually got the data through. By around 2003 i rarely encountered it anymore, but then I moved to a place with cable and never had to deal with it again.
> ever had to set your MTU to 1492 to accommodate an extra PPP header?
I had to replace my Apple AirPort Extreme when I got gigabit fiber since it didn't have a manual MTU setting and it didn't autodetect the MTU properly over PPPoE... In 2020 I still need to manually set the MTU on my Ubiquiti USG...
> ever had to set your MTU to 1492 to accommodate an extra PPP header?
Ah, I was always wondering why my ISP configured my fiber modem's mtu to 1492. So it's due to using PPPoE? Is there no way to use bigger mtu when using PPPoE?
Nowadays, there's PPPoA (over ATM) which wraps at a lower level, and allows 1500 byte ethernet payloads through. But running the ethernet over ATM at 1508 MTU so that PPPoE would be 1500 was probably out of reach --- when PPPoE was introduced, the customer endpoint was often the customer PC, and some of those were using fairly old nics that might not have supported larger packets.
Sadly, smaller than 1500 byte MTUs still cause issues for some people to this day. It's all fine if everything is properly configured, or if at least everything sends and receives ICMP, but if something is silently dropping packets, you're in for a bad day. These days, I think it's usually problems with customers sending large packets, as opposed to early days where receiving large packets would routinely fail, but a lot of that is because large sites gave up on sending large packets.
RFC 4638 provides a mechanism for this. It relies on the Ethernet devices supporting a ‘baby jumbo’ MTU of 1508 bytes and support for it is still a bit scarce.
When networks were new, computers connected to each other using a shared trunk that you physically drilled into. It's a non-trivial problem to send data over a shared channel; it's very easy for two systems to clobber each other. A primitive, but somewhat effective mechanism is ALOHA (https://en.wikipedia.org/wiki/ALOHAnet), where multiple senders randomly try to send their message to a single receiver. The single receiver then repeats back any messages it successfully receives. In that way the sender is able to confirm its message got through -- an ack. After a certain amount of time with no ack, senders repeat their messages. As you can imagine, shorter packets are less likely to cause collisions.
Ethernet uses something similar, but is able to detect if someone else is using the wire, called carrier sense. A short packet of 1500 bytes reduced the likelihood of collisions.
Not anymore for all practical purposes, but it once did for the very old 10Base-2 standard for Ethernet over coaxial cable. This is practically why the old MII Ethernet PHY interface protocol had the collision-sense lines to indicate to the MAC to stop sending data if it detects incoming data, in attempts to minimize collisions.
Unreliable IP fragmentation, and the brokenness of Path MTU Discovery (PMTUD), is causing the DNS folks to put a clamp on the size of (E)DNS message size:
I remembered something different related to shared medium and CMSA/CD where 1500 ensured fairness, and the minimum of 46 related to propagation time in the longest allowable cable
Minor factoid the article does not mention. ATM is an alternative to Ethernet that's used in many optical fiber environments. The "transfer unit" size of the ATM "cell" is 53 bytes (5 for the header and 48 for the payload). This is much smaller than 1500.
Another quirky story from the past:
Sometime around 20 years ago I was having a bizarre networking problem. I could telnet into a host with no trouble, and the interactive session would be going just fine until I did something that produced a large volume of output (such as 'cat' on a large file). At that point the session would freeze and I would eventually get disconnected. After troubleshooting for a while I identified the problem as one of the Ethernet NICs on the client host. It was a premium NIC (3Com 3C509). Nonetheless, the NIC crystal oscillator frequency had drifted sufficiently that it would lose clock synchronization to the incoming frame if the MTU was larger than about 1000.
Speaking about ATM: the 48 byte payload was a standardization compromise between Europe and the US.
US companies had prototypes using 64 bytes, while European companies used 32 bytes. To avoid anyone giving a competitive advantage, they decided on a middle ground of 48.
There were trade-offs between 32 and 64 bytes as well: a 32 byte payload had a higher overhead than a 64 byte payload, but it had a shorter transmission time which made it easier to do voice echo cancellation.
Or so I was told many decades ago when I got introduced to ATM systems...
I think the author may have made a mistake in some of the math. The frame size distribution plots are likely based on the number of frames, not the amount of data contained in said frames. The 1500 byte and other large frames should therefore account for the lion's share of the actual data transferred. Correcting this error will totally change the final two graphs.
Ah yeah, that's probably true. According to some back of the envelope math, it seems like the distribution should be more like 5%, 1%, 1%, 3%, 50%, 39%, ignoring the first and last size bins.
I find that technology cements in strata (the archaeology term) just as the layers that accumulate as the result of natural processes and human activity. The dynamics are not exactly the same but the tendency is similar. I wonder whether we'll always be capable of digging down deeper to the beginnings as things get more and more complicated.
I don't think the Ethernet Frame Overhead graph is correct. Surely the overhead is proportionally higher, per amount of data, for smaller packets. That graph shows that the overhead is just proportional to the amount of data sent, irrespective of the packet size, which can't be right.
Not my field, so I might be making an obvious error here, but:
If there are efficiency gains to be had from using jumbo frames, wouldn't setting my MTU to a multiple a 1500 still be of some benefit? If my PC, my switch and my router all support it, that would still be a tiiiny improvement. If the server's network does as well and let's say both of our direct providers, even if none of the exchanges or backbones in between do, that would still be an efficiency gain for ~10% of the link, right?
Locally you can set your MTU to larger than 1500, but if you (generally) try and send a packet towards the internet larger than 1500 it will be dropped without a trace, or it will be dropped and an ICMP message will be generated to tell your system to lower the MTU. Assuming you have not firewalled off ICMP ;)
As a handy feature on Linux at least, you can set your MTU to 9000 locally, and then set the default (internet generally) route to have a MTU of 1500 to prevent issues:
You could...if the software on both your server and your media pc support large frames. And you're willing to deal with every once in a while some piece of software doing the wrong thing and sending out every packet with large MTU without doing detection on max packet size.
Potentially but troubleshooting performance issues from mismatched MTU can be brutal so most providers drop anything over 1500.
Many devices can do over 1500 but anyone who has done so without careful consideration knows the outcome isn't predictable unless everyone on the network is prepared to do so.
A dedicated / controlled SAN type environment can do it just fine, beyond that it can be difficult.
Off-topic but looking at that old network card picture reminded me of a very vague memory of more than one card with a component that looked like a capacitor, except it looked cracked.
Is my mind playing tricks? Were they faulty units or was there meant to be a crack?
That looks like a capacitor with a built-in spark gap. I've seen them in CRT circuitry. They're probably using it in the network card so if there's a huge over-voltage (e.g. lightning somewhere), it will jump across the spark gap, limiting damage.
I still have a load of them gathering dust somewhere. However a system with ISA on it is a bit rare now and I'm not sure I can be bothered to compile a modern kernel small enough to boot on one. Besides, it will probably need cross compiling on something with some grunt that has heard of the G unit prefix.
So they just picked an arbitrary number that felt right? I expected the story to be more interesting than that, given the title. Still, there was some interesting trivia surrounding the core question.
> If we look at data from a major internet traffic exchange point (AMS-IX), we see that at least 20% of packets transiting the exchange are the maximum size.
He’s so optimistic. My brain heard this as “only 20% of packets […] are the maximum size”
What are all of those 64 byte packets? Interactive shells, or some other low bitrate protocol?
I've always wondered how 9000 became "jumbo". Technically anything over 1500 is consider ju!no and there is no standard. The largest I've seen is 16k. I think there are some crc accuracy concerns at larger sizes but 9k still seems quite arbitrary for computer land.
The explanation according to https://web.archive.org/web/20010221204734/http://sd.wareone... is: "First because ethernet uses a 32 bit CRC that loses its effectiveness above about 12000 bytes. And secondly, 9000 was large enough to carry an 8 KB application datagram (e.g. NFS) plus packet header overhead."
That is, 9000 is the first multiple of 1500 which can carry an 8192-byte NFS packet (plus headers), while still being small enough that the Ethernet CRC has a good probability to detect errors.
Ethernet frame size was never strictly limited. The way the packet length works with Ethernet II frames (802.3 is more explicit, but never really caught on) is that the hardware needs to read all the way to the end of the packet and detect a valid CRC and a gap at the end before it knows the thing is done. So there's no reason beyond buffer size to put a fixed limit on it, and different hardware had different SRAM configuration.
This reminds me of various Windows applications back in the day (Windows 3.1 and 95) that claimed to fine tune your connection and one of the tricks they used was changing the MTU setting, as far as I can recall. Could anyone share how that worked?
If your computer sends a larger MTU than the next device upstream can handle, the packets will be fragmented leading to increased CPU usage, increased work by the driver, higher I/O on the network interface, higher CPU load on your router or modem, etc depending on where the bottleneck is. For example if you connect over Ethernet to a DSL modem, or to a router that has a DSL uplink, all your packets will be fragmented. This is because DSL uses 8 bytes per packet for PPPoE authentication. So if you send a 1500 byte packet to the modem, it will get broken up by the modem into 2 packets: one is 1492+8 bytes, and the other is 8+8 bytes.
But your PC is still sending more packets.. the modem is struggling to fragment them all and send them upstream.. its memory buffer is filling up.. your computer is retrying packets that it never got a response on..
By lowering your computer MTU to 1492 to start with, you avoid the extra work by the modem, which can have considerable speed increase.
Probably a dumb question, why the maximum size (and the one has most of packages) in the AMS-IX graph 1514 bytes instead of 1500 bytes that got discussed in the article?
1500 bytes is the MTU of IP, in most cases. It often excludes the Ethernet header, which is 14 bytes excluding the FCS, preamble, IFG, and any VLANs.
If have a 1500 byte MTU for IP, then we need at least a 1514 byte MTU for IP + Ethernet. We often call the > 1514B MTU the "interface MTU". It's unnecessarily confusing.
I think you’re saying that the smallest bucket of packets are all packets that would have been combined with a larger packet of that had been an option... but that doesn’t make sense. That class of packets includes TCP SYN, ACK, RST, and 128 bytes could fit an entire instant message on many protocols.
This would only be possible if you were talking from a jumbo-configured client (let's say you've set up your laptop correctly), across a jumbo-configured network (Starlink, in your scenario), to a jumbo-configured server (here's the problem).
The problem is that Starlink only controls the steps from your router to "the internet". If you're trying to talk to spacex.com it'd be possible, but if you're trying to talk to google.com then now you need Starlink to be peering with ISPs that have jumbo frames, and they need to peer with ISPs with jumbo frames, etc etc and then also google's servers need to support jumbo frames.
Basically, the problem is that Starlink is not actually end to end, if you're trying to reach arbitrary servers on the internet. It just connects you to the rest of the internet, and you're back to where you started.
This is also true for any other ISP, Starlink is not special in this regard.
True, you'd expect endpoints to support Jumbo Frames as well, but why not start at least making it possible. It's a dead loop otherwise. IPv6 was the same at start.
Because you don't know what you're talking about and are engaging in "what if"-isms? There is no business case to solve with jumbos frames over the Internet. I've been in this business for 20 years. Seen this argument a dozen times. It never changes.
The data isn't encoded in the usual ways, so even 4 hours of begging FFMPEG were to no avail.
A few glances at wireshark payloads, the roughly translated documentation, and weighing my options, I embarked on a harrowing journey to synthesize the correct incantation of bytes to get the device to give me what I needed.
I've never worked with RTP/RTSP prior to this -- and I was disheartened to see nodejs didn't have any nice libraries for them. Oh well, it's just udp when it comes down to it, right?
SO MY NAIVETE BEGOT A JOURNEY INTO THE DARKNESS. Being a bit of an unknown-unknown, this project did _not_ budget time for the effort this relatively impromptu initiative required. An element of sentimentality for the customer, and perhaps delusions of grandeur, I convinced myself I could just crunch it out in a few days.
A blur of coffee and 7 days straight crunch later, I built a daisy chain of crazy that achieved the goal I set out for. I read rfc3550 so many times I nearly have it committed to memory. The final task was to figure out how to forward the stream I had ensorcelled to another application. UDP seemed like the "right" choice, if I could preserve the heavy lifting I had accomplished to reassemble the frames of data.. MTU sizes are not big enough to accomodate this (hence probably why the device uses RTP, LOL.). OSX supports some hilariously massive MTU's (It's been a few days, but I want to say something like 13,000 bytes?) Still, I'd have to chunk and reasemble each frame into quarters. Having to write _additional_ client logic to handle drops and OOO and relying on OSX's embiggened MTU's when I wanted this to be relatively OS independent... and the SHIP OR DIE pressure from above made me do bad. At this point, I was so crunched out that the idea of writing reconnect logic and doing it with TCP was painful so I'm here to confess... I did bad...
The client application spawns a webserver, and the clients poll via HTTP at about 30HZ. Ahhh it's gross...
I'm basically adrift on a misery raft of my own manufacture. Maybe protobufs would be better? I've slept enough nights to take a melon baller to the bad parts..