> 1. Data Plane Development Kit, which lets you skip the kernel IP stack (which takes thousands of CPU cycles to process) and do packet processing in userland taking just tens to hundreds of cycles per packet. http://dpdk.org/
I wish OS developers saw this as a problem. There is no reason kernel stacks should be so slow for tasks where all processing is done in the kernel. (For packets destined to userspace, you've got the syscall overhead to deal with.)
I recently tested the Linux network stack's PPS performance with an Intel X520 10GbE NIC. I used Debian testing, with the 3.12 kernel. My destination machine was an i7-3930K at stock speed. I wrote a simple kernel module adding an NF_IP_PRE_ROUTING hook returning NF_DROP with no processing, which would be the simplest possible code path. For a packet generator, I used another older machine with another X520, using the "pfsend" tool included with PF_RING, and the card in PF_RING DNA mode. That was easily able to saturate the link at line rate (14.8M PPS).
The result: the kernel was only able to sustain about 2.8M PPS.
I then loaded the DNA driver on the destination machine, used the included "pfcount" tool, and no packet drops - it was receiving the full 14M PPS.
I tested DPDK recently and had similar results.
I also modified the Linux ixgbe driver ixgbe_clean_rx_irq() function, and added a step in between the "fetch packets from RX ring" and "put packet in SKB and send to network stack" functions. Even when I added a bunch of useless comparisons for each packet, I was able to get ~12-13M PPS. I could get line rate by just dropping and not doing any processing.
Definitely. I am not sure what metrics Solarflare beat others (presumably including Intel) on, besides software support (OpenOnload implements the BSD sockets API and doesn't require software changes, while DPDK and others implement their own). In my testing, the X520s have been capable of both sending and receiving at line-rate.
It's possible the Solarflare NICs could be better at some metric like latency, by a matter of microseconds (only speculating here), but I can't see how they'd consider that worth the extra money for HTTP servers.
I wish OS developers saw this as a problem. There is no reason kernel stacks should be so slow for tasks where all processing is done in the kernel. (For packets destined to userspace, you've got the syscall overhead to deal with.)
I recently tested the Linux network stack's PPS performance with an Intel X520 10GbE NIC. I used Debian testing, with the 3.12 kernel. My destination machine was an i7-3930K at stock speed. I wrote a simple kernel module adding an NF_IP_PRE_ROUTING hook returning NF_DROP with no processing, which would be the simplest possible code path. For a packet generator, I used another older machine with another X520, using the "pfsend" tool included with PF_RING, and the card in PF_RING DNA mode. That was easily able to saturate the link at line rate (14.8M PPS).
The result: the kernel was only able to sustain about 2.8M PPS.
I then loaded the DNA driver on the destination machine, used the included "pfcount" tool, and no packet drops - it was receiving the full 14M PPS.
I tested DPDK recently and had similar results.
I also modified the Linux ixgbe driver ixgbe_clean_rx_irq() function, and added a step in between the "fetch packets from RX ring" and "put packet in SKB and send to network stack" functions. Even when I added a bunch of useless comparisons for each packet, I was able to get ~12-13M PPS. I could get line rate by just dropping and not doing any processing.