LWN.net needs you! Without subscribers, LWN would simply not exist. Please considersigning up for a subscription and helping to keep LWN publishing |
Imagine, he said, that you are running an Internet service provider (ISP) in the year 2000 — the distant past. To set up your business you need to arrange for bandwidth, core routers, and some DSL hardware, and the job is done. This is an idealized picture, he acknowledged, but the fact remains that the core business in that era was simply providing access to the Internet.
Fast forward to 2005, and an aspiring ISP must do all of the above, plus it must buy some boxes to provide voice-over-IP service. By 2010, that ISP must also handle television, video on demand, protection against denial-of-service attacks, and cope with an increasingly constrained IPv4 address space. But, over this period of time where the job of running an ISP has gotten harder, the basic subscriber fee has remained about the same.
That is the trend he has been seeing in the ISP area: providers have to do more with the same budget. "Doing more" means putting a bunch of expensive boxes in their racks; each function requires a specialized box with a high price. This problem doesn't just apply to ISPs; it pops up in many other networking environments. It seems like there should be a better way to handle this problem.
While all this was happening, he said, there was another trend: commodity hardware caught up with the fancy networking boxes. It's possible to buy a dual-socket, Xeon-based server with twelve cores or more per socket; this machine can then be equipped with many high-speed PCIe network interfaces. The result is hardware that can handle data rates of up to 200Gb/second — if each core/interface pair can handle up to 15 million packets per second. That gives a processing-time budget of about 70ns per packet.
What is the software on such a system going to look like? The conventional wisdom is that Linux is taking over, so one would expect an ISP's racks to be full of Linux servers. That turns out to be true, but not in the way that one might expect. The Linux kernel is not ready to handle rates of 10-15 million packets per second; the networking stack is too heavy. Despite its weight, the networking stack does not normally do everything that is needed, so there must be a user-space application running as well. The split between kernel and user space adds another barrier and slows things down further.
User-space networking
The way to actually reach the desired level of performance, he said, is to remove the kernel from the picture and do the entire networking job in user space. A simple user-space program can map the interface's control registers into its address space, set up a ring buffer for transmission and reception of packets, and do whatever simple processing is required. At the end you have a user-space network driver. There are a number of toolkits to help with writing this kind of driver. One of them is Snabb, which was started in 2012. Others include DPDK, also started in 2012, and VPP, which got going in 2016.
Network operators have been trying to regain some control over the systems they have to buy to provide services. For example, Deutsche Telekom's TeraStream architecture is intended to move network functions into software rather than keeping those functions in separate physical machines. Instead of buying a box to provide a certain function, they want to buy a virtual machine that can be installed on a commodity server.
These functions can be implemented with a system like Snabb. The idea behind Snabb, he said, is "rewritable software" — as in "I could rewrite that in a weekend". The hard part is finding elegant hacks; the implementation should then be easy.
A Snabb program consists of a graph of apps, connected by directional links. The basic processing cycle is called a "breath"; during each breath, a batch of packets will be processed. The whole thing is written in the Lua language. To illustrate how it works, he put up a slide with a simple program:
local Intel82599 = require("apps.intel.intel_app").Intel82599 local PcapFilter = require("apps.packet_filter.pcap_filter).PcapFilter local = config.new() config.app(c, "nic", Intel82599, {pciaddr="82:00.0"}) config.app(c, "filter", PcapFilter, {filter="tcp port 80"}) config.link(c, "nic.tx -> filter.input") config.link(c, "filter.output -> nic.rx") engine.configure(c) while true do engine.breathe() end
This program starts by importing two modules that will implement the apps; the Intel82599 module drives the network interface, whilePcapFilter allows the expression of packet filters using thetcpdump language. Two apps are instantiated with configurations telling them what to do; the nic app is given the PCI address of the interface, and filter is told to accept packets addressed to TCP port 80. The two links route packets from the interface, through the filter, and back out the interface again.
The final line actually runs this program. Each breath consists of two phases. In the first phase, each app "inhales" the packets that are available to it; that is driven by a call to each app's pull() function. A pull() function will typically bring in a maximum of 100 packets in a single invocation. Then each app processes those packets and pushes them into its outbound link once directed via a call to its push() function.
The definition of a packet in Snabb might be surprising to people who have worked on networking in the kernel, he said; it looks like this:
struct packet { uint16_t length; unsigned char data[10*1024]; };
There is none of the overhead that accompanies the kernel's SKB structure. "This must be a relief", he said. No attempt is made to keep packets on the device; Snabb relies on the device transferring the packet and getting it into the L3 cache. That lets it avoid a lot of complexity around tracking packets in different locations. It does put some headroom at the beginning of the packet so that headers can be prepended without copying if needed. A link is not much more complicated; it is a simple circular buffer. And that, he said, is all there is.
Design principles
Snabb was built around a set of three simple design principles:
- Simple > Complex
- Small > Large
- Commodity > Proprietary
With regard to "simple", Snabb is built around the ability to compose network functions from small parts; the apps can be independently developed, and they all connect together easily with links. Snabb can be thought of as an implementation of the Unix pipeline metaphor. The simple packet and link data structures are also an expression of this design goal. One could make these structures more complicated in an attempt to optimize things, and it might even lead to better results on some benchmarks, but there would be a cost to pay in the ability to understand and change the system as a whole.
For small: the original Snabb implementation had a code budget. Snabb as a whole was meant to be less than 10,000 lines of code and build in less than one minute. These constraints, it was hoped, would lead to problems being solved in a creative way. They got a lot of help from their use of LuaJIT, which makes it easy to write code at a high level of abstraction that still performs well. The Snabb project also worked to minimize its dependencies, and those that are needed (such as LuaJIT itself) are included with the source and must fit within the build-time constraint.
To stay small, Snabb also avoids depending on big projects. Rather than use the DPDK drivers, Snabb's developers have written their own. The DPDK drivers have some appeal; there are a lot of them, and the project has a great deal of vendor participation. But Snabb wants to own the entire data plane, including the drivers, so that things can be changed at any point.
Snabb's drivers are typically less than 1,000 lines of code, much lighter than a typical, abstraction-heavy vendor driver. Writing the driver for the Intel 82599 was easy, since there is a good data sheet available. They refused to write drivers for the Mellanox ConnectX-4 interfaces until Mellanox provided an open data sheet — which Mellanox eventually did in response to customers wanting Snabb support.
The approach to drivers shows Snabb's adherence to the "commodity" principle. The project seeks simple drivers that are easily interchangeable; it is preferable to do work in the CPU rather than in the interface whenever possible. TCP checksum offloading comes up every couple of years on LWN, he said, but they don't bother with it; they write a simple checksum routine in Lua and move on. When their offload features are unused, network interfaces become commodities.
Present and future
The project has gotten patches from 27 authors since 2012, and it has been deployed in "a dozen sites or so". Some of the biggest programs so far are an NFV virtual switch, an lwAFTR IPv6 transition router, and a virtual private network at SWITCH.ch. New work includes control-plane integration, support for running as a virtualized guest, better multi-process support, and more.
Igalia (Wingo's employer) developed the lwAFTR router mentioned above. This router is the central component of a "lightweight 4-over-6" transitional system. It can be thought of as a big network-address translation (NAT) box. If an ISP deploys a box like this, it will be carrying all of that ISP's IPv4 traffic, which is "a bit of stress". The goal was to carry 10Gb/second, using two interfaces.
They were able to reach the speed goal, partly because, LuaJIT does a good job of making things fast. The "graph of apps" architecture plays well to the LuaJIT optimizer, since it causes a program to be composed of a number of small loops. LuaJIT's trace optimization helps to optimize those loops further. Using LuaJIT's FFI mechanisms to define data structures using C syntax provides exact control over how things are laid out, and the result is easily accessible from within Lua. It is important to avoid data-dependency chains, like those found in linked lists or hash tables; even a single cache miss will take a big dent out of the processing-time budget.
The project's latency goals were met by avoiding memory allocations (another thing that LuaJIT's optimizer helps with) and avoiding system calls whenever possible. Running on reserved CPUs eliminates preemption, which would otherwise be another source of latency.
He concluded by noting that scalability work is ongoing. 2017 is "the year of 100G in production" with Snabb. To get there, Snabb will need to support multiple processes servicing the same interface. The interface cards themselves will have to get a little better to hit that goal. There is work toward supporting horizontal scaling via the BGP and ECMP protocols. Many other projects are underway as well.
The video of this talk is available on YouTube.
[Your editor would like to thank linux.conf.au and the Linux Foundation for assisting with his travel to the event.]
(Log in to post comments)