Sorry about the slightly-clickbaity title. I actually have at least a 10 GbE card (and switch) on the way to test those and see if I can get more out of it, but for _this_ test, I had a 4-interface Intel I340-T4, and I managed to get a maximum throughput of 3.06 Gbps when pumping bits through all 4 of those plus the built-in Gigabit interface on the Compute Module.
For some reason I couldn't break that barrier, even though all the interfaces can do ~940 Mbps on their own, and any three on the PCIe card can do ~2.8 Gbps. It seems like there's some sort of upper limit around 3 Gbps on the Pi CM4 (even when combining the internal interface) :-/
But maybe I'm missing something in the Pi OS / Debian/Linux kernel stack that is holding me back? Or is it a limitation on the SoC? I though the ethernet chip was separate from the PCIe lanes on it, but maybe there's something internal to the BCM2711 that's bottlenecking it.
Its a single lane pcie gen2 interface. The max theoretical is 500MB/sec. So you can't ever touch 10G with it. In reality getting 75% of theoretical on PCIe tends to be a rough upper limit on most PCIe interfaces, so the 3Gbit your seeing is pretty close to what one would expect.
edit: Oh its 3Gbit across 5 interfaces, one of which isn't PCIe, so the PCIe side is probably only running at about 50%. It might be interesting to see if the CPUs are pegged (or just one of them). Even so, PCIe on the rpi isn't coherent so that is going to slow things down too.
So, this is sorta indicative of a RSS problem, but on the rpi it could be caused by other things. Check /proc/interrupts to assure you have balanced MSI's, although that itself could be a problem too.
edit: run `perf top` to see if that gives you a better idea.
>It might be interesting to see if the CPUs are pegged (or just one of them).
This is very likely the answer. I see a lot of people who think of the Pi as some kind of workhorse and are trying to use it for things that it simply can't do. The Pi is a great little piece of hardware, but it's not really made for this kind of thing. I'd never think about using a Raspberry Pi if I had to think about "saturating a NIC".
True true... though in my work trying to get a flexible 10 GbE network set up in my house, I've found that the support for 2.5 and 5 GbE are iffy at best on many devices :(
Heh, I know that ~3 Gbps is the maximum you can get through the PCIe interface (x1, PCI 2.0), so that is expected. But I was hoping the internal ethernet interface was separate and could add one 1 Gbps more... the CPU didn't seem to be maxed out and was also not overheating at the time (especially not with my 12" fan blasting on it).
I assume you saw the video with Plunkett the RPF put out ( https://youtu.be/yiHgmNBOzkc specifically interesting at 10:45 ) - He mentioned he was testing 10GbE fibre and reached 3.2gbit. Now he goes into absolutely zero detail on that, but I find it interesting you've both hit the same ceiling.
(He also mentioned 390MB/sec write speed to nvme, which is suspiciously close to the same ceiling)
Yeah, I think the PCIe link hits a ceiling around there.
Note that combining the internal interface with the 4 NIC interfaces, and overclocking to 2.147 GHz got it up to 3.4 total Gbps. So the IRQ interrupts are the main bottleneck when it comes to total network packet throughout.
Since you're also from the midwest, I'll put it in terms you'll understand: :-)
> I think the PCIe link hits a ceiling around there.
You're trying to shove 10 gallons of shit into a 5-gallon bucket!
--
I'm not sure how high you can set the MTU on those Pi's (the Intels should handle 9000) but I'd set them as high as they'll go, if I were you. An MTU of 9000 basically means ~1/6th the interrupts.
You might be hitting the limits of the RAM. I think LPDDR3 maxes out at ~4.2Gbps, and running other bus masters like the HDMI and OS itself would be cutting into that.
LPDDR cannot sustain anywhere near the max speed of the interface. It's more of a hope that you can burst something out and go to sleep rather than trying to maintain that speed. In a lot of ways DRAM hasn't gotten faster in decades when you look at how latency clocks nearly always increase at the same rate of interface speed increases. And LPDDR is the niche where that shines the most, because it doesn't have oodles of dies to interleave to hide that issue.
Innumeracy strikes again. It's actually 4-5 Gbytes/s [1] plus whatever bandwidth the video scanout is stealing (~400 Mbytes/s?). That's only ~40% efficient which is simultaneously terrible and pretty much what you'd expect from Broadcom. However 4 Gbytes/s is 32 Gbits/s which leaves plenty of headroom to do 5 Gbits/s of network I/O.
Not in a holistic way AFAIK, and for sure not rigged up to the Raspbian kernel (since all of that lives on the videocore side), but I bet Broadcom or the RPi foundation has access to some undocumented perf counters on the DRAM controller that could illuminate this if they were the ones debugging it.
Technically it's not a lie—there are 5x1 Gbps of interfaces here. But I wanted to acknowledge that I used a technicality to get the title how I wanted it, because if I didn't do that, a lot of people wouldn't read it, and then we wouldn't get to have this enlightening discussion ;)
No, it is not. That NIC is a PCIe Gen2 NIC. By using only a single lane, you're limiting the bandwidth to ~500MB/sec theoretical. That's 4Gb/s theoretical, and getting 3Gb/s is ~75% of the theoretical bandwidth, which is pretty decent.
Can you run an lspci -vvv on the Intel NIC? I just re-read things, and it seems like 1 of those Gb/s is coming from the on-board NIC. I'm curious if maybe PCIe is running at Gen1
So its running Gen2 x1, which is good. I was afraid that it might have downshifted to Gen1. Other threads point to your CPU being pegged, and I would tend to agree with that.
What direction are you running the streams in? In general, sending is much more efficient than receiving ("its better to give than to receive"). From your statement that ksoftirqd is pegged, I'm guessing you're receiving.
I'd first see what bandwidth you can send at with iperf when you run the test in reverse so this pi is sending. Then, to eliminate memory bw as a potential bottleneck, you could use sendfile. I don't think iperf ever supported sendfile (but its been years since I've used it). I'd suggest installing netperf on this pi, running netserver on its link partners, and running "netperf -tTCP_SENDFILE -H othermachine" to all 5 peers and see what happens.
Well, when a LAN is 1Gb/s they are actually not talking about real bits. It actually is 100MB/s max, not 125MB/s as one might expect. Back in the old days they used to call it baud.
This is wrong; 1 Gbps Ethernet is 125 MB/s (including headers/trailer and inter-packet gap so you only get ~117 in practice). Infiniband, SATA, and Fibre Channel cheat but Ethernet doesn't.
The 10:1 bits/bytes ratio common in some kinds of equipment is in fact a 5:4 encoding to make it easier to detect bit boundaries and to avoid various electrical problems with the signal.
Modems used to do this too. The 'cheat' is that they report Layer 1 bandwidth, which is a completely useless number to the end user. The bulk of the loss occurs between Layer 1 and Layer 2 (with dribs and drabs for packet headers and so forth)
I think I've found the bottleneck now that I have the setup up and running again today—ksoftirqd quickly hits 100% CPU and stays that way until the benchmark run completes.
You might want to try enabling jumbo frames by setting the MTU to something >1500 bytes. Doing so should reduce the number of IRQs per unit of time since each frame will be carrying more data and therefore there will be fewer of them.
According to the Intel 82580EB datasheet[1] it supports an MTU of "9.5KB." It's unclear if that means 9500 or 9728 bytes.
I looked briefly for a datasheet that includes the ethernet specs. of the Broadcom 2711 but didn't immediately find anything.
Recent versions of iproute2 can output the maximum MTU of an interface via:
# Look for "maxmtu" in the output
ip -d link list
Barring that you can try incrementally upping the MTU until you run in to errors.
The MTU of an interface can be set via:
ip link set $interface mtu $mtu
Note that for symmetrical testing via direct crossover you'll want to have the MTU be the same on each interface pair.
I set the MTU to its max (just over 9000 on the intel, heh), but that didn't make a difference. The one thing that did move the needle was overclocking the CPU to 2.147 GHz (from base 1.5 GHz clock), and that got me to 3.4 Gbps. So it seems to be a CPU constraint at this point.
Oh shoot, completely forgot to do that, as I was testing a few things one after the other and it slipped my mind. I'll have to try and see if I can get a little more out.
You can also use ethtool -C on the NICs on both ends of the connection to rate limit the irq signal handeling allowing you to optimize for throughput instead of latency.
I can get about 1.2-1.7 gigabit on the Pi 4 using a 2.5GBe USB NIC (Realtek). Some other testing shows the vendor driver to be faster, but when I tested it on a much faster ARM board, I can get the full 2.5GBe with the in-tree driver.
When you loop back Ethernet links in the same computer, you need to take care with the configuration, because normally the operating system will not route the Ethernet packets through the external wires but will process them like being for localhost, so you will see a very large speed without any relationship with the Ethernet speed.
How to force the packets through the external wires depends on the operating system. On Linux you must use namespaces and assign the two Ethernet interfaces that are looped on each other to two distinct namespaces, then set appropriate routes.
Would that truly be able to test send / receive of a full (up to) gigabit of data to/from the interface? If it's loopback, it could test either sending 500 + receiving 500, or... sending 500 + receiving 500. It's like sending data through localhost, it doesn't seem to reflect a more real-world scenario (but could be especially helpful just for testing).
I think maybe they meant linking Port 1 to Port 2, and Port 3 to Port 4? Also I believe gigabit ethernet can be full duplex, so you should be able to send 1000 and receive 1000 on a single interface at the same time if it's in full duplex mode.
No you still have to send the data over the pcie link, but DPDK should basically offload all the network work to the nic, so that you are just streaming data to it. The kernel won't need to deal with IRQs or timing or anything like that.
I might be making things up, but I believe you can also run code on DPDK nics? i.e. beyond straight networking offload. If that's the case you could try compressing the data before you DMA it to the nic. This would make no sense normally, but if your bottleneck is in fact the pcie x1 link and you want to saturate the network, it would be something worth trying.
I mean really the whole thing is at most a fun exercise as the nic costs more than the pi.
It seems like there are two limits: PCIe bus up to about 3.2 Gbps, and total network bandwidth about 3 Gbps. So the total net bandwidth limits the 4x card, and also limits any combination of card interfaces and the built in interface (I tested many combos).
Overclocking can get the total net throughput to 3.4 Gbps.
For some reason I couldn't break that barrier, even though all the interfaces can do ~940 Mbps on their own, and any three on the PCIe card can do ~2.8 Gbps. It seems like there's some sort of upper limit around 3 Gbps on the Pi CM4 (even when combining the internal interface) :-/
But maybe I'm missing something in the Pi OS / Debian/Linux kernel stack that is holding me back? Or is it a limitation on the SoC? I though the ethernet chip was separate from the PCIe lanes on it, but maybe there's something internal to the BCM2711 that's bottlenecking it.
Also... tons more detail here: https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...