5Gbps Ethernet on the Raspberry Pi Compute Module 4

geerlingguy · on Oct 30, 2020

Sorry about the slightly-clickbaity title. I actually have at least a 10 GbE card (and switch) on the way to test those and see if I can get more out of it, but for _this_ test, I had a 4-interface Intel I340-T4, and I managed to get a maximum throughput of 3.06 Gbps when pumping bits through all 4 of those plus the built-in Gigabit interface on the Compute Module.

For some reason I couldn't break that barrier, even though all the interfaces can do ~940 Mbps on their own, and any three on the PCIe card can do ~2.8 Gbps. It seems like there's some sort of upper limit around 3 Gbps on the Pi CM4 (even when combining the internal interface) :-/

But maybe I'm missing something in the Pi OS / Debian/Linux kernel stack that is holding me back? Or is it a limitation on the SoC? I though the ethernet chip was separate from the PCIe lanes on it, but maybe there's something internal to the BCM2711 that's bottlenecking it.

Also... tons more detail here: https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...

StillBored · on Oct 30, 2020

Its a single lane pcie gen2 interface. The max theoretical is 500MB/sec. So you can't ever touch 10G with it. In reality getting 75% of theoretical on PCIe tends to be a rough upper limit on most PCIe interfaces, so the 3Gbit your seeing is pretty close to what one would expect.

edit: Oh its 3Gbit across 5 interfaces, one of which isn't PCIe, so the PCIe side is probably only running at about 50%. It might be interesting to see if the CPUs are pegged (or just one of them). Even so, PCIe on the rpi isn't coherent so that is going to slow things down too.

geerlingguy · on Oct 30, 2020

It looks like the problem is `ksoftirqd` gets pegged at 100% and the system just queues up packets, slowing everything down. See: https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...

InvaderFizz · on Oct 30, 2020

I would suggest you go ahead and try jumbo frames[0] as that will significantly decrease the CPU load and overhead.

I would also suggest using taskset[1] on each iperf server process to bind them each to a different cpu core.

Finally, I would suggest UDP on iperf and let the sending Pi's just completely saturate the link.

If you do all that, I think you have a good chance at achieving 3.5Gbps over just the Intel card.

0: https://blah.cloud/hardware/test-jumbo-frames-working/

1: https://linux.die.net/man/1/taskset

dualboot · on Oct 30, 2020

This is common even on x86 systems.

You have to set the irq affinity to utilize the available CPU cores.

There is a script included with the source you used to compile drivers called "set_irq_affinity"

Ex (Sets IRQ Affinity for all available cores) :

[path-to-i40epackage]/scripts/set_irq_affinity -x all ethX

geerlingguy · on Oct 30, 2020

So like https://pastebin.com/2Z4UECPq ? — this didn't make a difference in the overall performance :(

dualboot · on Oct 30, 2020

Looks like the script needs to be adjusted to function on the Pi.

I wish I had the cycles and the kit on hand to play with this!

StillBored · on Oct 30, 2020

So, this is sorta indicative of a RSS problem, but on the rpi it could be caused by other things. Check /proc/interrupts to assure you have balanced MSI's, although that itself could be a problem too.

edit: run `perf top` to see if that gives you a better idea.

geerlingguy · on Oct 30, 2020

Results:

    15.96%  [kernel]                      [k] _raw_spin_unlock_irqrestore
    12.81%  [kernel]                      [k] mmiocpy
     6.26%  [kernel]                      [k] __copy_to_user_memcpy
     6.02%  [kernel]                      [k] __local_bh_enable_ip
     5.13%  [igb]                         [k] igb_poll

When it hit full blast, I started getting "Events are being lost, check IO/CPU overload!"

SoapSeller · on Oct 30, 2020

Another idea will be to increase interrupt coalescing via ethtool -c/C

leptons · on Oct 30, 2020

>It might be interesting to see if the CPUs are pegged (or just one of them).

This is very likely the answer. I see a lot of people who think of the Pi as some kind of workhorse and are trying to use it for things that it simply can't do. The Pi is a great little piece of hardware, but it's not really made for this kind of thing. I'd never think about using a Raspberry Pi if I had to think about "saturating a NIC".

geerlingguy · on Oct 30, 2020

Well it can saturate up to two, and almost three, gigabit NICs now. So not too shabby.

But I like to know the limits so I can plan out a project and know whether I'm safe using a Pi, or a 3-5x more expensive board or small PC :)

ksec · on Oct 30, 2020

>Sorry about the slightly-clickbaity title.

Well yes because 5Gbps Ethernet is actually a thing ( NBase-T or 5GBASE-T). So 1Gbps x 5 would be more accurate.

Cant wait to see results on 10GbE though :)

P.S I really wish 5Gbps Ethernet is more common.

ncrmro · on Oct 30, 2020

My ATT router made by Nokia has one 5gbe and the fiber plugs in directly with SFP!

geerlingguy · on Oct 30, 2020

True true... though in my work trying to get a flexible 10 GbE network set up in my house, I've found that the support for 2.5 and 5 GbE are iffy at best on many devices :(

voltagex_ · on Oct 31, 2020

The "best" I've found so far (and gives you options to go 2.5GbE/5GbE

1: https://www.amazon.com/UGREEN-Ethernet-Thunderbolt-Converter... (USB-C)

2: https://www.amazon.com/2-5GBase-T-Ethernet-Controller-Standa... (Don't get the knock-off version of this, the brackets aren't the right sizes.) (PCIe)

The expensive ones I'm waiting to arrive:

3: Either a second hand Intel X520-DA1 card or the "refurb" from AliExpress

and https://mikrotik.com/product/crs305_1g_4s_in with RJ10 SFP+ modules. Then cry at how much you just spent.

johnwalkr · on Oct 31, 2020

X520-DA1 and most older stuff doesn’t support 2.5Gbps or 5Gbps (only 1 and 10).

voltagex_ · on Oct 31, 2020

Oops. They haven't arrived yet. Good thing they weren't too expensive. What would be my best bet for supporting all of those speeds?

mmastrac · on Oct 30, 2020

Awesome work. Been watching your videos on these (the video card one was especially interesting).

At what point are you saturating the poor little ARM CPU (or its tiny PCIe interface)?

geerlingguy · on Oct 30, 2020

Heh, I know that ~3 Gbps is the maximum you can get through the PCIe interface (x1, PCI 2.0), so that is expected. But I was hoping the internal ethernet interface was separate and could add one 1 Gbps more... the CPU didn't seem to be maxed out and was also not overheating at the time (especially not with my 12" fan blasting on it).

dualboot · on Oct 30, 2020

with some tuning you should be able to saturate the PCIe 1x slot.

Excellent reading on this available here :

http://www.intel.com/content/dam/doc/application-note/82575-...

and here :

https://blog.cloudflare.com/how-to-achieve-low-latency/

Edit : with the inbound 10Gb card referenced

toast0 · on Oct 30, 2020

Was all this TCP? You might try UDP as well, in case you're hitting a bottleneck in the tcp stack.

soneil · on Oct 30, 2020

I assume you saw the video with Plunkett the RPF put out ( https://youtu.be/yiHgmNBOzkc specifically interesting at 10:45 ) - He mentioned he was testing 10GbE fibre and reached 3.2gbit. Now he goes into absolutely zero detail on that, but I find it interesting you've both hit the same ceiling.

(He also mentioned 390MB/sec write speed to nvme, which is suspiciously close to the same ceiling)

geerlingguy · on Oct 31, 2020

Yeah, I think the PCIe link hits a ceiling around there.

Note that combining the internal interface with the 4 NIC interfaces, and overclocking to 2.147 GHz got it up to 3.4 total Gbps. So the IRQ interrupts are the main bottleneck when it comes to total network packet throughout.

jlgaddis · on Oct 31, 2020

Since you're also from the midwest, I'll put it in terms you'll understand: :-)

> I think the PCIe link hits a ceiling around there.

You're trying to shove 10 gallons of shit into a 5-gallon bucket!

--

I'm not sure how high you can set the MTU on those Pi's (the Intels should handle 9000) but I'd set them as high as they'll go, if I were you. An MTU of 9000 basically means ~1/6th the interrupts.

stratosmacker · on Oct 30, 2020

Jeff,

First off, thank you for doing this kind of 'r&d', it is really exciting to see what the Pi is capable of after less than a decade.

Would you be interested in someone testing a SAS PCI card? I'm going to pick up one of these as soon as they're not backordered...

wil421 · on Oct 30, 2020

Do you think an SFP+ nic would work? It would be cool to try out fiber.

baybal2 · on Oct 30, 2020

There are no SFP option on 5gbps NICs as i understand as per standard

monocasa · on Oct 30, 2020

You might be hitting the limits of the RAM. I think LPDDR3 maxes out at ~4.2Gbps, and running other bus masters like the HDMI and OS itself would be cutting into that.

wmf · on Oct 30, 2020

32-bit LPDDR4-3200 should give 12.8 Gbytes/s which is 102 Gbits/s.

monocasa · on Oct 30, 2020

You can't just multiply width*frequency for DRAM these days, as much as I wish we still lived in the days of ubiquitous SRAM.

The chip in some of the 2GB RPI4s is rated for only 3.7Gbps.

https://www.samsung.com/semiconductor/dram/lpddr4/K4F6E304HB...

wmf · on Oct 30, 2020

No, that chip is rated for 3.7 Gbps per pin and it's 32 bits wide. Even at ~60% efficiency you're an order of magnitude off.

monocasa · on Oct 30, 2020

Real world tests are seeing around 3 to 4 Gbps of memory bandwidth.

https://medium.com/@ghalfacree/benchmarking-the-raspberry-pi...

LPDDR cannot sustain anywhere near the max speed of the interface. It's more of a hope that you can burst something out and go to sleep rather than trying to maintain that speed. In a lot of ways DRAM hasn't gotten faster in decades when you look at how latency clocks nearly always increase at the same rate of interface speed increases. And LPDDR is the niche where that shines the most, because it doesn't have oodles of dies to interleave to hide that issue.

wmf · on Oct 30, 2020

Innumeracy strikes again. It's actually 4-5 Gbytes/s [1] plus whatever bandwidth the video scanout is stealing (~400 Mbytes/s?). That's only ~40% efficient which is simultaneously terrible and pretty much what you'd expect from Broadcom. However 4 Gbytes/s is 32 Gbits/s which leaves plenty of headroom to do 5 Gbits/s of network I/O.

[1] https://www.raspberrypi.org/forums/viewtopic.php?t=271121

hedgehog · on Oct 30, 2020

Those numbers look way off, maybe they mixed up the units? Should be a few GBps at least.

mlyle · on Oct 30, 2020

Bits aren't bytes.

monocasa · on Oct 30, 2020

The y axis is labeled "megabits per second".

Dylan16807 · on Oct 31, 2020

The y axis is wrong.

mmastrac · on Oct 30, 2020

Is there a way to see if you are hitting memory bandwidth issues in Linux?

monocasa · on Oct 30, 2020

Not in a holistic way AFAIK, and for sure not rigged up to the Raspbian kernel (since all of that lives on the videocore side), but I bet Broadcom or the RPi foundation has access to some undocumented perf counters on the DRAM controller that could illuminate this if they were the ones debugging it.

CyberDildonics · on Oct 30, 2020

Instead of lying and then apologizing once you get what you want, it would be better to just not lie in the first place.

geerlingguy · on Oct 30, 2020

Technically it's not a lie—there are 5x1 Gbps of interfaces here. But I wanted to acknowledge that I used a technicality to get the title how I wanted it, because if I didn't do that, a lot of people wouldn't read it, and then we wouldn't get to have this enlightening discussion ;)

CyberDildonics · on Oct 30, 2020

You could hook up a 100gbs card, but that wouldn't make it 100gbs ethernet on a raspberry pi.

IntelMiner · on Oct 31, 2020

It would, but it wouldn't be able to push 100gbs. It's not lying

geerlingguy · on Oct 31, 2020

I'm just looking around for this mystery 100 Gbps card. Think it'll work with Cat9?

CyberDildonics · on Oct 31, 2020

Did you think this was made up for some reason? 100Gb cards are not new. QSFP28 was created in 2014.

https://www.broadcom.com/products/ethernet-connectivity/netw...

They don't use copper, they use fiber. It wouldn't be a mystery if you searched for '100gbs pci ethernet'.

drewg123 · on Oct 30, 2020

So theoretically, 5 Gbps was possible

No, it is not. That NIC is a PCIe Gen2 NIC. By using only a single lane, you're limiting the bandwidth to ~500MB/sec theoretical. That's 4Gb/s theoretical, and getting 3Gb/s is ~75% of the theoretical bandwidth, which is pretty decent.

geerlingguy · on Oct 30, 2020

I'll take pretty decent, then :)

I mean, before this the most I had tested successfully was a little over 2 Gbps with three NICs on a Pi 4 B.

drewg123 · on Oct 30, 2020

Can you run an lspci -vvv on the Intel NIC? I just re-read things, and it seems like 1 of those Gb/s is coming from the on-board NIC. I'm curious if maybe PCIe is running at Gen1

geerlingguy · on Oct 30, 2020

Here you go! https://pastebin.com/A8gsGz3t

drewg123 · on Oct 30, 2020

So its running Gen2 x1, which is good. I was afraid that it might have downshifted to Gen1. Other threads point to your CPU being pegged, and I would tend to agree with that.

What direction are you running the streams in? In general, sending is much more efficient than receiving ("its better to give than to receive"). From your statement that ksoftirqd is pegged, I'm guessing you're receiving.

I'd first see what bandwidth you can send at with iperf when you run the test in reverse so this pi is sending. Then, to eliminate memory bw as a potential bottleneck, you could use sendfile. I don't think iperf ever supported sendfile (but its been years since I've used it). I'd suggest installing netperf on this pi, running netserver on its link partners, and running "netperf -tTCP_SENDFILE -H othermachine" to all 5 peers and see what happens.

stkdump · on Oct 30, 2020

Well, when a LAN is 1Gb/s they are actually not talking about real bits. It actually is 100MB/s max, not 125MB/s as one might expect. Back in the old days they used to call it baud.

wmf · on Oct 30, 2020

This is wrong; 1 Gbps Ethernet is 125 MB/s (including headers/trailer and inter-packet gap so you only get ~117 in practice). Infiniband, SATA, and Fibre Channel cheat but Ethernet doesn't.

hinkley · on Oct 31, 2020

The 10:1 bits/bytes ratio common in some kinds of equipment is in fact a 5:4 encoding to make it easier to detect bit boundaries and to avoid various electrical problems with the signal.

Modems used to do this too. The 'cheat' is that they report Layer 1 bandwidth, which is a completely useless number to the end user. The bulk of the loss occurs between Layer 1 and Layer 2 (with dribs and drabs for packet headers and so forth)

geerlingguy · on Oct 30, 2020

I think I've found the bottleneck now that I have the setup up and running again today—ksoftirqd quickly hits 100% CPU and stays that way until the benchmark run completes.

See: https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...

iscfrc · on Oct 30, 2020

You might want to try enabling jumbo frames by setting the MTU to something >1500 bytes. Doing so should reduce the number of IRQs per unit of time since each frame will be carrying more data and therefore there will be fewer of them.

According to the Intel 82580EB datasheet[1] it supports an MTU of "9.5KB." It's unclear if that means 9500 or 9728 bytes.

I looked briefly for a datasheet that includes the ethernet specs. of the Broadcom 2711 but didn't immediately find anything.

Recent versions of iproute2 can output the maximum MTU of an interface via:

  # Look for "maxmtu" in the output
  ip -d link list

Barring that you can try incrementally upping the MTU until you run in to errors.

The MTU of an interface can be set via:

  ip link set $interface mtu $mtu

Note that for symmetrical testing via direct crossover you'll want to have the MTU be the same on each interface pair.

[1] https://www.intel.com/content/www/us/en/embedded/products/ne... (pg. 25, "Size of jumbo frames supported")

geerlingguy · on Oct 30, 2020

I set the MTU to its max (just over 9000 on the intel, heh), but that didn't make a difference. The one thing that did move the needle was overclocking the CPU to 2.147 GHz (from base 1.5 GHz clock), and that got me to 3.4 Gbps. So it seems to be a CPU constraint at this point.

zamadatix · on Oct 30, 2020

Did you change the mtu of the other sides as well? If not tcp will negotiate an mss that makes the larger mtu go unused.

geerlingguy · on Oct 31, 2020

Oh shoot, completely forgot to do that, as I was testing a few things one after the other and it slipped my mind. I'll have to try and see if I can get a little more out.

neurostimulant · on Oct 30, 2020

I wonder if using user-space tcp stack (or anything that could bypass the kernel) could push the number higher.

syoc · on Oct 30, 2020

I would have a look at sending data with either DPDK (https://doc.dpdk.org/burst-replay/introduction.html) or AF_PACKET and mmap (https://sites.google.com/site/packetmmap/ )

You can also use ethtool -C on the NICs on both ends of the connection to rate limit the irq signal handeling allowing you to optimize for throughput instead of latency.

q3k · on Oct 30, 2020

Seems to be in the same ballpark as when I got ~3.09Gbps on the Pi4's PCIe, but on a single 10G link: https://twitter.com/q3k/status/1225588859716632576

geerlingguy · on Oct 30, 2020

Oh, nice! How did I not find your tweets in all my searching around?

q3k · on Oct 30, 2020

Shitposting on Twitter makes for bad SEO :).

baybal2 · on Oct 30, 2020

A much easier option:

Get a USB 3.0 2.5G or 5G card. With a fully functional DMA on the USB controller it can get quite close to PCIE option.

A setback for all Linux users at the moment:

The only chipmaker making USB NICs doing 2.5G+ is RealTek, and RealTek chose to use USB NCM API for their latest chips.

And as we know Linux support for NCM now is super slow, and buggy.

I barely got 120megs from it. Will welcome any kernel hacker taking on the problem.

vetinari · on Oct 30, 2020

> The only chipmaker making USB NICs doing 2.5G+ is RealTek, and RealTek chose to use USB NCM API for their latest chips.

QNAP QNA-UC5G1T uses Marvell AQtion AQC111U. Might be worth a try.

voltagex_ · on Oct 31, 2020

I can get about 1.2-1.7 gigabit on the Pi 4 using a 2.5GBe USB NIC (Realtek). Some other testing shows the vendor driver to be faster, but when I tested it on a much faster ARM board, I can get the full 2.5GBe with the in-tree driver.

magicalhippo · on Oct 31, 2020

Thanks, that's useful info as I was just thinking about getting one of those for a project.

unilynx · on Oct 30, 2020

> "I need four computers, and they all need gigabit network interfaces... where could I find four computers to do this?"

Why not loop the ports back to themselves? IIRC, 1gbit ports should autodetect when they're cross connected so it wouldn't even need special cables

adrian_b · on Oct 30, 2020

When you loop back Ethernet links in the same computer, you need to take care with the configuration, because normally the operating system will not route the Ethernet packets through the external wires but will process them like being for localhost, so you will see a very large speed without any relationship with the Ethernet speed.

How to force the packets through the external wires depends on the operating system. On Linux you must use namespaces and assign the two Ethernet interfaces that are looped on each other to two distinct namespaces, then set appropriate routes.

geerlingguy · on Oct 30, 2020

Would that truly be able to test send / receive of a full (up to) gigabit of data to/from the interface? If it's loopback, it could test either sending 500 + receiving 500, or... sending 500 + receiving 500. It's like sending data through localhost, it doesn't seem to reflect a more real-world scenario (but could be especially helpful just for testing).

nitrogen · on Oct 30, 2020

I think maybe they meant linking Port 1 to Port 2, and Port 3 to Port 4? Also I believe gigabit ethernet can be full duplex, so you should be able to send 1000 and receive 1000 on a single interface at the same time if it's in full duplex mode.

jlgaddis · on Oct 31, 2020

It's full-duplex, that's 1000 Mbps in each direction simultaneously.

escardin · on Oct 30, 2020

It's probably outside the scope (and possibly cheating) but could a DPDK stack & supported nic[1] push you past the PCIe limit?

[1] https://core.dpdk.org/supported/

q3k · on Oct 30, 2020

Does DPDK actually let you not have to DMA packet data over to the system memory and back?

escardin · on Oct 31, 2020

No you still have to send the data over the pcie link, but DPDK should basically offload all the network work to the nic, so that you are just streaming data to it. The kernel won't need to deal with IRQs or timing or anything like that.

I might be making things up, but I believe you can also run code on DPDK nics? i.e. beyond straight networking offload. If that's the case you could try compressing the data before you DMA it to the nic. This would make no sense normally, but if your bottleneck is in fact the pcie x1 link and you want to saturate the network, it would be something worth trying.

I mean really the whole thing is at most a fun exercise as the nic costs more than the pi.

geerlingguy · on Oct 31, 2020

Could be the first pi to mine crypto on a NIC. 30 years later...

classics2 · on Oct 31, 2020

One mans dull travelogue of copying and pasting things.

ProAm · on Oct 30, 2020

That was a fun read. Thanks.

nojokes · on Oct 30, 2020

Did you test without main Ethernet connection?

geerlingguy · on Oct 31, 2020

nojokes · on Oct 31, 2020

Meaning using on board Ethernet will not increase or decrease the bandwidth?

geerlingguy · on Oct 31, 2020

It seems like there are two limits: PCIe bus up to about 3.2 Gbps, and total network bandwidth about 3 Gbps. So the total net bandwidth limits the 4x card, and also limits any combination of card interfaces and the built in interface (I tested many combos).

Overclocking can get the total net throughput to 3.4 Gbps.