This must be the 10th blog post to land on HN on the same topic, and they all walk through the same steps and all use the same hardware (ixgbe), which is by the way a hard prerequisite to make much of these strategies effective.
In any case, stop reinventing the wheel, just use a library purpose-made:
I am a Snabb hacker and I see things differently. Ethernet I/O is fundamentally a simple problem, DPDK is taking the industry in the wrong direction, and application developers should fight back.
Ethernet I/O is simple at heart. You have an array of pointer+length packets that you want to send, an array of pointer+length buffers where you want to receive, and some configuration like "hash across these 10 rings" or "pick a ring based on VLAN-ID." This should not be more work than, say, a JSON parser. (However, if you aren't vigilant you could easily make it as complex as a C++ parser.)
DPDK has created a direct vector for hardware vendors to ship code into applications. Hardware vendors have specific interests: they want to differentiate themselves with complicated features, they want to get their product out the door quickly even if that means throwing bodies at a complicated implementation, and they want to optimize for the narrow cases that will look good on their marketing literature. They are happy for their complicated proprietary interfaces to propagate throughout the software ecosystem. They also focus their support on their big customers via account teams and aren't really bothered about independent developers or people on non-mainstream platforms.
Case in point: We want to run Snabb on Mellanox NICs. If we adopt the vendor ecosystem then we are buying into four (!) large software ecosystems: Linux kernel (mlx5 driver), Mellanox OFED (control plane), DPDK (data plane built on OFED+kernel), and Mellanox firmware tools (mostly non-open-source, strangely licensed, distributed as binaries that only work on a few distros). In practice it will be our problem to make sure these all play nice together and that will be a challenge e.g. in a container environment where we don't have control over which kernel is used. We also have to accept the engineering trade-offs that the vendor engineering team has made which in this case seems to include special optimizations to game benchmarks [1].
I say forget that for a joke.
Instead we have done a bunch more work up front to first successfully lobby the vendor to release their driver API [2] and then to write a stand-alone driver of our own [3] that does not depend on anything else (kernel, ofed, dpdk, etc). This is around 1 KLOC of Lua code when all is said and done.
I would love to hear from other people who want to join the ranks of self-sufficient application developers. Honestly our ConnectX driver has been a lot of work but it should be much easier for the next guy/gal to build on our experience. If you needed a JSON parser you would not look for a 100 KLOC implementation full of weird vendor extensions, so why do that for an ethernet driver?
> DPDK is taking the industry in the wrong direction, and application developers should fight back.
DPDK is doing the exact same work you did, make hardware vendor release their driver API and abstract it away so that Application developer can stay independent from it.
You "successfully lobbyied" for one API to be released. Now do that for any number of hardware, NICs versions, and in the end you will have to release a generic API, which is effectively a new DPDK.
Completely independent applications will only go so far. You are left with a vendor lock-in with a very high upfront cost if you ever need to evolve your hardware.
I understand your perspective. If you are satisfied with using a vendor-provided software stack to interface with hardware then you are well catered for by DPDK and do not have to care what is under the hood.
I feel that the hardware-software interface is fundamental and that vendors should not control the software. I see an analogy to CPUs. I am really happy that CPU vendors document their instruction sets and support independent compiler developers. I would be disappointed if they started keeping their instruction sets confidential, available only under NDA, and told everybody to just use their LLVM backend without understanding it.
That is effectively the case. See for example DDIO with Intel which can only be enabled for specific devices with full cooperation between Intel and this particular vendor.
You cannot compete with a DDIO-enabled device, which of course all Intel devices are.
See also the Intel multibuffer crypto library, which was specialized and timed for Intel CPUs. No one else could write at this level of optimization, because we do not have the internal design and simulator that Intel work with.
So yeah, you are talking with sophisticated hardware which will have firmware blobs and undocumented features. If you only rely on general instructions sets you will only get so far. When we are talking about ns of latency and these level of bandwidth, they make the difference between several stacks.
The push for smart-NICs will increasingly blur the line between soft and hard layer. We can either direct our efforts so as to avoid rewriting an abstraction layer upon it or do so for each vendor-specific API (OFED is but one example, there will be others).
I disagree with this characterization of DDIO but I don't think Hacker News comments is the best venue for such low-level discussions. Hope to chat with you about it in some more suitable forum some time :) that would be fun.
My understanding is that DDIO is an internal feature of the processor and works transparently with all PCI devices. Basically Intel extended the processor "uncore" to serve PCIe DMA requests via the L3 cache rather than directly to memory.
I think you're confusing DDIO with DCA. DDIO is Intel's mechanism of allocating L3 cache ways to DMA, and works for any vendor's card. DCA is an older set of steering hints that cause per-TLP steering hint flags to influence whether or not a DMA write ends up in the CPU cache. DCA is highly targeted, and much more effective in realistic workloads because you can be smart, and cache just descriptors and packet header DMA writes (eg, metadata). With DDIO, you end up caching everything, and with a limited number of cache ways, you end up often caching nothing, because later DMAs push earlier ones out of cache before the host can use the data.
At a previous employer, we figured out the DCA steering hits and implemented it in our NIC. Thankfully enough of our PCIe implementation was programmable to allow us to do this.
Depending what you are doing, with latencies (or throughput) in that range, sticking a black box library in there right away might not be the best idea always. Doing what the author did is also a way to learn how things work. Eventually the library might be the answer, but if I had to do what they did, I would do it by hand first as well.
Plus it also supports even lower level drivers for a bunch of cards (some are VM virtualised, such as the Intel em), as well as AF_PACKET, oh, and pcap.
Lots of the low latency options require driver and Hardware cooperation, busy polling, BQL, essentially all of the ethtool options, even the IRQ affinity.
Intel has been a driving force behind many kernel networking improvements but they naturally don't care for other manufacturers, so they implement a little bit of kernel infrastructure and put the rest into their drivers.
There's still not a lot of oomph left over to do anything with the traffic...or is that not the point of the exercise? You're not going to be comparing it, or writing it to disk at these levels.
This traffic is a bit above the levels I've dealt with, but I've seen Cloud Datacenter levels of traffic that, as far as I know, you can't practically log/monitor/IPS/SIEM...or am I misinformed?
> you can't practically log/monitor/IPS/SIEM...or am I misinformed?
It depends the hardware you're using (specifically router), but using netflow / sflow / ipfix [0] you can get pretty high visibility even for high bandwidth networks. This only gets you "metadata" and not a full packet capture - but for monitoring and the like, the metadata can be far more useful.
I'm not entirely sure what level of traffic you're talking about, but I know it's possible with the right hardware to use netflow with 100GbE links without having to sample (ie: Recording flows for every packet, not 1 in every n packets)
Not necessarily. One can capture 10G to disk using "only" a RAID-0 with 8-10 mechanical disks: it does the job both in bandwidth and space, and you can use regular filesystems such as XFS.
40G is a little bit more difficult: you need a huge RAID (simple, direct scaling: 32-40 disks) with mechanical disks to achieve the necessary bandwidth, and if you want to use SSDs you will need a lot of them too in order to have enough space to save any meaningful amount of traffic.
I remember seeing papers on on-the-fly compression for network traffic, but IIRC the results were not very impressive and the performance cost was noticeable.
> but how do you then get that many packets to disk
it may be possible to do disk i/o at that high rate e.g. with pci-e or a dedicated appliance for dumping the entire stream. but you would running out of storage pretty fast.
for example, a quick back-of-the-envelope calculation, where you dump packet stream from 4x10gbps cards with minimal 84b size (on ethernet), show that you would exhaust the storage in approx. 4.5 minutes :)
> While I don't know the exact overhead of 10GigE, there is likely still some overhead.
on 10gige pipes, at max Ethernet mtu (1500) bytes etc, there is approx. 94% of available bandwidth for user data (accounting for things like inter-frame-gap, crc checksums etc). with jumbo-frames that number goes to 99%.
Of course you're going to try to avoid storing all traffic, but to decide what's interesting it has at least to be captured first. And on big sites, 40 Gb/s is still only an already random-sampled or pre-filtered subset of all traffic.
The inability to analyze traffic at this rate is a serious problem. How do you study it to see how protocols can be improved? A lab environment cannot compare to real world traffic. How do you detect attacks (not DoS!) if it's hidden in a link operating at this capacity?
Even if you can capture the traffic at wire speed, the CPU doesn't have the power to analyse the stream. I thought that traffic analysers had to be done with FPGA/ASIC because of that.
My manager did his thesis on this. Endace NICs, split traffic up and send to a cluster of IDS servers. Allows you to actually do line rate analysis. No need for FPGA/ASIC.
when monitoring a network, and faced with choices of where to tap the network, a tap which captures a wider view can be advantageous. For instance, capturing at a WAN link can provide a better view of attackers.
Sorry I haven't found the actual slides yet, that's why it's a photo from someone who took it while attending the talk.