Hacker News new | past | comments | ask | show | jobs | submit login
Kernel-Bypass Networking (godaddy.com)
81 points by bbowen on Dec 10, 2019 | hide | past | favorite | 55 comments



Couldn't read that in Hong Kong because GoDaddy automatically redirects me to the hk.godaddy.com domain, which doesn't have that article.

Update: If you are like me and can't see the article at your localised GoDaddy.com website, you can (hopefully) select United States at the bottom of the page to force GoDaddy to serve you the US site.


How important is it to bypass the kernel if the kernel doesn't get to see/handle each packet individually?

As soon as you get the hardware to handle TCP reassembly and just wake the kernel up once per few megabytes of data sent/received, things scale well again.

There's work to do though - there are no systems around today that I'm aware of which can send data from SSD to a TCP socket (common use case for cache server) without the data itself going through the CPU (despite most chipsets allowing the network card to be sent data directly from a PCIE-connected SSD).


> As soon as you get the hardware to handle TCP reassembly and just wake the kernel up once per few megabytes of data sent/received, things scale well again.

Not a generic solution, because it increases latency.

Which you could avoid by doing everything in user or kernel space.


> As soon as you get the hardware to handle TCP reassembly and just wake the kernel up once per few megabytes of data sent/received, things scale well again.

Not everything in this world is TCP. TFA mentions DNS, for instance (which as of this time is still mostly UDP based).

> There's work to do though - there are no systems around today that I'm aware of which can send data from SSD to a TCP socket (common use case for cache server) without the data itself going through the CPU (despite most chipsets allowing the network card to be sent data directly from a PCIE-connected SSD).

They certainly exist, they are just not available or known to the general public (hint: look for hyperscalers that manufacture custom network cards).


I suppose that CPU would schedule a DMA transfer from SSD to RAM, and then from RAM to a NIC.

For NICs that have memory-mapped buffers, it could be just one transfer.

What am I missing?


Nothing. That's how it would work. But today's Linux kernel doesn't do that (even when you use the sendfile() API)


> But today's Linux kernel doesn't do that (even when you use the sendfile() API)

Oh ? I’m quite disappointed, I thought that was the whole point of sendfile() ! What does it do if not that ?


simple data transfer is : SSD -> kernel -> user-space -> kernel -> NIC

sendfile allows for: SSD -> kernel -> NIC

which is already a major improvement


Right, silly me, I had forgotten userspace :). Thanks !


sendfile() has done zero-copy networking for a while (the data still goes through main memory and it gets DMAed twice, but the cores don't touch it). More recently, NVMe controller memory buffers should allow data to be DMAed from the SSD to the NIC without hitting RAM and it looks like Linux is adding support for it using the p2pdma subsystem.


It's pretty effective for things like Intrusion Detection Systems (IDS), at least in my experience. I don't know enough about all of the glue in-between, but things like Suricata scream when combined with dedicated kernel bypass cards.


How would your “SSD to TCP” solution handle TLS which is pretty ubiquitous? What about the SSD being encrypted or using a complex file system (ZFS, etc) where data is stored in non-continuous blocks?


Pass TLS session key to hardware. Hardware does encryption.

Where the data to be sent is non-contiguous in storage, the OS would probably have to make two requests (although those two requests could still be made simultaneously, so not requiring an extra wake-up/interrupt)


> Typically, an application using BSD uses system calls to read and write data to a socket. Those system calls have overhead due to context switching and other impacts.

On the other hand the kernel isn't standing still, overhead reductions have been trickling in over decades. sendfile, epoll, recv-/sendmmsg, all the multi-queue stuff, kTLS with hardware offload, io_uring, p2p-dma. The C10k problem was tackled in 1999, userspace APIs can get you much further today.


The current approach is fundamentally not going to work in the long term. 100Gbps at line rate means single digits[1] of nanoseconds between frames. At that frequency, a cache miss is pretty bad.

This is all not to mention locks, or that there are competing functions running in most distros (turn off irqbalance completely and watch your forwarding rate increase).

The low hanging fruit seems to have been picked as well - NAPI polling, interrupt coalescing, RSS + multique NICs + SMP, etc, are already out there, and we're still struggling to do 10G line rate in the Kernel...and data centers are moving quickly to 25/100G.

[1] Edited for terrible math - 10Gbps at line rate is 67ns per packet, 100Gbps is 6.7ns


We are not struggling to do line rate 10G in the kernel. Modern 100Gbe nics (mellanox, solarflare) will happily do line rate with stock upstream kernel for a while now (definitely since 4.x) you only need to tune your irq balancing, and you can probably get away with not even doing that. If you are buying 100gbe nics you are also buying server class (xeon, rome) processors that can keep up.

Source: I operate a CDN with thousands of 100Gbe nics with a stock upstream LTS kernel, and minimal kernel tuning.


You're saying you can forward 100Gbps at line rate (148MPPS) through a stock kernel?


You can get within a few percentage points, yes

I just tested this with two hosts with 4.14.127 upstream kernel and upstream mlx5 driver, and mellanox connectx-5 card. Using 16 iperf threads

[SUM] 0.0-10.0 sec 85.1 Gbits/sec

That's pretty close with no tuning, and well beyond 10gb/s we mentioned earlier


Wrong. You’re like an order of magnitude wrong - rofl there is no fucking way the stock Linux kernel will even do 40Mpps at 64 byte packets. It chokes way before that. This is partly why things like DPDK exist.


16 iperf threads...sending at what packet size? Do you understand the notion of line rate? 85Gbps at 1500B is only 7MPPS, which is half of 10Gbps at line rate.


Where are you getting your definitions? I have never seen "line rate" used to refer to packets per second.



"How do you fill a 100GBps pipe with small packets?"

"achieve 10 Gbps line rate at 60B frames"

"reaching line rate on all packet sizes"

Line rate is just bits per second. You have to add in a qualifier about packet size before you're talking about packets per second.


You're both right. It's an older term from the early 90's when a router's selling point was being able to hit "line rate" with the smallest possible packet size. Example, how many tiny datagrams can you forward to fill that link. Back then, people were still doing lots of routing on general purpose machines and Cisco/Juniper were just starting to get into the high performance game.

These days line-rate just means sending enough traffic to fill the link at whatever rate you want. That's generally good enough for server folk since they just want to get you the cat pics ASAP.

That's not good enough for people running transit networks, since they care more about packets per second performance. Sending huge amounts of data is easy for them; what they really care about is PPS.

Aside, the next generations of router NPU's are trash in terms of PPS performance. I take that back, they're not trash. They're the trash in the dumpsterfire. That's how bad they are. We're fairly screwed there.

My guess is GoDaddy was looking at increased PPS performance either for DNS or maybe building their own DDoS mitigation framework (Arbor gear is pricey).


Nope, I'm sorry you're not quite getting it here. Minimum Ethernet frame is 84B on the wire - it's simple enough from there.


I've never heard this weird qualification for the definition of "line rate" that it somehow requires minimum packet size, so I looked it up. The first three sources for a quoted big-g search all imply or directly state that it's the same as bandwidth:

https://blog.ipspace.net/2009/03/line-rate-and-bit-rate.html

https://www.reddit.com/r/networking/comments/4tk2to/bandwidt...

https://www.fmad.io/blog-what-is-10g-line-rate.html

Also, for gigabit networks, ethernet packets are padded to at least 512 bytes because of a bigger slot size: https://www.cse.wustl.edu/~jain/cis788-97/ftp/gigabit_ethern...


Line rate does imply pps at the smallest sized frames in the context of networking equipment performance. Vendors use it extensively in their docs.

64B is the minimum frame size in Ethernet, including interframe gap and preamble its 84B on the wire. It is the same with Ethernet, Gigabit Ethernet and even 100Gbit Ethernet, that source is not correct.

https://kb.juniper.net/InfoCenter/index?page=content&id=KB14...


No line rate does not "imply pps at the smallest sized frames."

Network hardware always quote PPS using the smallest sizes. And this makes sense for things like route and switch processors. Perhaps you are confusing that.

You should reread your link a little more carefully. From your link:

">However it is also important to make sure that the device has the capacity or the ability to switch/route as many packets as required to achieve wire rate performance."

The key phrase there is "as required." Almost nobody needs to sustain forwarding Ethernet frames with empty TCP segments or empty UDP datagrams in them. In fact many vendors will spec for an average size. Since packet size x PPS will give you your throughput, if the average packet size is larger you need much less PPS to achieve line rate.


Line rate doesn't imply small packets. But most userspace benchmarking uses 64B packets. That being said, the "imix" packet size, which is supposed to represent internet traffic, is around 500B.


There are numerous imix definitions floating around now, it really depends on who you ask. The more realistic ones define a pattern of different sized packets. And performance varies greatly depending on which imix flavour you use, which probably goes back to the earlier poster's 'dumpster fire' comment (although I don't know exactly what NPU generation they are referring to).


Just like the sibling, I admit I’ve never heard this definition of line rate...


No, PPS and bandwidth are two distinct metrics. Although there can be a linear relationship between them line rate does not always imply a payload equal or close to the interface MTU. You see this with network vendors and some of the higher end gear. Network vendors always give specs for their gear by quoting both metrics. And there is some gear that that is capable of doing line rates even with small packets, example the Cisco ASR 1000:

"For example, because one of the newest Cisco routers, the Cisco ASR 1000 Series Router, is capable of forwarding packets at up to 16 Mp/s with services enabled, it can support the processing of the equivalent of 10 Gb/s of traffic at line rate, with services, even for small packets."[1]

[1] https://tools.cisco.com/security/center/resources/network_pe...


Does line rate imply smallest packets? For CDN style use cases, you want to use the whole pipe, and it's going to be mostly larger packets.


Yes.

The point of discussion here is that the Linux kernel struggles to do line rate 10Gbps. This was misinterpreted as "the Linux kernel struggles to do 10Gbps".


You’re sending 1500 byte MTU (1538 bytes on the wire) or maybe larger (9000 byte MTU) packets.

1538 bytes is 12,304 bits. 10,000,000,000 bits/sec / 12,304 bits/packet is 812,744 packets per second.

Now try it with 64 byte packets, which are 84 bytes on the wire.

14,880,952 packets per second.

And this is 10gbps.


You briefly mentioned Receive Side Steering (RSS) with multiqueue network cards, but didn't mention Receive Flow Steering.

From the kernel docs: https://www.kernel.org/doc/Documentation/networking/scaling....

""" While RPS steers packets solely based on hash, and thus generally provides good load distribution, it does not take into account application locality. This is accomplished by Receive Flow Steering (RFS). The goal of RFS is to increase datacache hitrate by steering kernel processing of packets to the CPU where the application thread consuming the packet is running. RFS relies on the same RPS mechanisms to enqueue packets onto the backlog of another CPU and to wake up that CPU. """

I have zero problem doing 25G and 40G ethernet at the Linux kernel in RHEL7 (I run Kubernetes clusters on them in fact). Non-ethernet (Infiniband) 100G+ line rate is also totally doable, but IB is an entirely different beast. I agree with you it isn't going to work in the long term, but it should be fine in the medium term. The long term Linux change is likely to just rewrite more and more of the existing filtering code ontop of the eBPF vm, which is fast as holy hell and is replacing large swaths of existing filtering mechanisms for a reason.


What if you have no application and you're purely forwarding traffic? That's what I'm measuring here (and that's actually typically faster than passing packets up to user space apps to consume through syscalls).

25G line rate is 37MPPS - are you saying you have zero problem forwarding at that line rate? I'd be very surprised if that's being done in the kernel. I'd be more surprised if you said that you're consuming bytes from the network at that speed with user space apps and no kernel bypass.

XDP (the forwarding approach built on top of eBPF) is limited to ~20MPPS, as well: https://www.netronome.com/blog/bpf-ebpf-xdp-and-bpfilter-wha...


For routing, most people prefer to use... routers. Linux isn't great for forwarding and IMO that's fine. As the article notes, there's always VPP when you need to do software routing for some reason.


Yes, and the current user space approaches are spinloop based so the machine isn't really general purpose at that point. You might as well just call it a router if all your cores are spinning on RSS/etc queues.


With one thread, maybe. But using multiple threads it’s not even that hard. I’ve hit 100Gbps using stock TCP stack and ~10 threads in Rusts Hyper without much trouble.

Another example, you can saturate 100Gbps with just 4 iPerf3 processes.


Kernel isn't made for non interrupting networking. With bsd you can use network cards that go along with kernel bypass to deliver packets without requiring interrupts. Then add highly optimized simd parsing of these many packets at once and you get significant performance improvements necessary for high network throughput systems.


Disclaimer: I work on VPP.

The typical usecase are virtual network functions: think virtual switches/routers used to interconnect VMs or containers, or containerized VPN gateways etc. It is also used for high-performance L3-L4 load-balancers etc.

As pointed out by others, what is hard is to move small packets. TCP with iperf is not relevant for this kind of workloads. It is easy to max out 100GbE with 1500-bytes packets, but with 200-bytes packets not so much. This is why they communicate about PPS, not bandwidth.

There results seems low but it is hard to tell without knowing the platform or configuration. VPP can sustain 20+Mpps / core (2 hyperthreads) on Skylake @2.5GHz (no turboboost).


VPP is amazing - you made my work life much easier... for free :-). Great work. More people should dig into high-perf open source networking.

Thank you!!


I wonder what motivated GoDaddy to research this at all, since at the end they say that it isn't necessary to pursue their research any further. Driving tech-minded traffic with blog posts?


Have you seen how much a router costs? And probably they don't need to route any faster.


This was already an issue back at the beginning of the century at CERN, to handle high data rates.

Here is a relatively recent paper of the kind of work being done in this area,

https://iopscience.iop.org/article/10.1088/1748-0221/8/12/C1...


What are the disadvantages of kernel bypass?


You also tend to end up dedicating cores. That is a side effect of spinning in a loop checking for the next packet/queue processing rather than waiting for an interrupt.

There is also the additional overhead of VFIO/IOMMU's which can easily eat double digit perf vs doing the processing in a trusted kernel context.

So with a normal machine the cores/etc are balanced between processes, with DPDK/etc thats hard as some of the cores will be 100% consumed in those spinloops even if your experiencing 5% line rate.


You don't get kernel features anymore, and have to re-implement the ones you need yourself.


Correct - things you take for granted like ARP and TCP/IP are completely up to you to take care of. Further, the Kernel has little visibility into most kernel bypass stacks, so /proc or iproute are often blind to what's happening.

DPDK does have "kernel interfaces", so you can direct packets to the kernel.


I wonder why this sort of thing seems to be thought radical. Infiniband RDMA well estabilshed, with low latency and high bandwidth. (There was DMA between the micro-kernel-ish systems we used in the 1980s which got basically Ethernet line speed on <1 MIPS systems, as I recall; I assume it wasn't a new idea then.)


and fibrechannel, and various other protocols too. The point being that ethernet+IP/TCP is uniquely poor/difficult at offload and the minimum packet sizes are tiny.

TCP is genius for a WAN, but unlike most things designed in the past 25 or so years, robustness precedes performance.


Kernel bypass isn't the same thing as offload. I don't understand "minimum packet sizes are tiny". The 1980s system was driving Ethernet, just not with Unix/sockets.


Berkeley not Berkely





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: