Coding Relic: Requiem for Jumbo Frames

This weekend Greg Ferro published an article about jumbo frames. He points to recent measurements showing no real benefit with large frames. Some years ago I worked on NIC designs, and at the time we talked about Jumbo frames a lot. It was always a tradeoff: improve performance by sacrificing compatibility, or live with the performance until hardware designs could make the 1500 byte MTU be as efficient as jumbo frames. The latter school of thought won out, and they delivered on it. Jumbo frames no longer offer a significant performance advantage.

Roughly speaking, software overhead for a networking protocol stack can be divided into two chunks:

Per-byte which increases with each byte of data sent. Data copies, encryption, checksums, etc make up this kind of overhead.
Per-packet which increases with each packet regardless of how big the packet is. Interrupts, socket buffer manipulation, protocol control block lookups, and context switches are examples of this kind of overhead.

Wayback machine to 1992

I'm going to talk about the evolution of operating systems and NICs starting from the 1990s, but will focus on Unix systems. DOS and MacOS 6.x were far more common back then, but modern operating systems evolved more similarly to Unix than to those environments.

Address spaces in user space, kernel, and NIC hardware Lets consider a typical processing path for sending a packet in a Unix system in the early 1990s:

Application calls write(). System copies a chunk of data into the kernel, to mbufs/mblks/etc.
Kernel buffers handed to TCP/IP stack, which looks up the protocol control block (PCB) for the socket.
Stack calculates a TCP checksum and populates the TCP, IP, and Ethernet headers.
Ethernet driver copies kernel buffers out to the hardware. Programmed I/O using the CPU to copy was quite common in 1992.
Hardware interrupts when the transmission is complete, allowing the driver to send another packet.

Altogether the data was copied two and a half times: from user space to kernel, from kernel to NIC, plus a pass over the data to calculate the TCP checksum. There were additionally per packet overheads in looking up the PCB, populating headers, and handling interrupts.

The receive path was similar, with a NIC interrupt kicking off processing of each packet and two and a half copies up to the receiving application. There was more per-packet overhead for receive: where transmit could look up the PCB once and process a sizable chunk of data from the application in one swoop, RX always gets one packet at a time.

Jumbo frames were a performance advantage in this timeframe, but not a huge one. Larger frames reduced the per-packet overhead, but the per-byte overheads were significant enough to dominate the performance numbers.

Wayback Machine to 1999

An early optimization was elimination of the separate pass over the data for the TCP checksum. It could be folded into one of the data copies, and NICs also quickly added hardware support. [Aside: the separate copy and checksum passes in 4.4BSD allowed years of academic papers to be written, putting whatever cruft they liked into the protocol, yet still portraying it as a performance improvement by incidentally folding the checksum into a copy.] NICs also evolved to be DMA devices; the memory subsystem still had to bear the overhead of the copy to hardware, but the CPU load was alleviated. Finally, operating systems got smarter about leaving gaps for headers when copying data into the kernel, eliminating a bunch of memory allocation overhead to hold the TCP/IP/Ethernet headers.

Packet size vs throughput in 2000, 2.5x for 9180 byte vs 1500 I have data on packet size versus throughput in this timeframe, collected in the last months of 2000. It was gathered for a presentation at LCN 2000. It used an OC-12 ATM interface, where LAN emulation allowed MTUs up to 18 KBytes. I had to find an old system to run these, the modern systems of the time could almost max out the OC-12 link with 1500 byte packets. I recall it being a Sparcstation-20. The ATM NIC supported TCP checksums in hardware and used DMA.

Roughly the year 1999 was the peak of when jumbo frames would have been most beneficial. Considerable work had been done by that point to reduce per-byte overheads, eliminating the separate checksumming pass and offloading data movement from the CPU. Some work had been done to reduce the per-packet overhead, but not as much. After 1999 additional hardware focussed on reducing the per-packet overhead, and jumbo frames gradually became less of a win.

LSO/LRO

Protocol stack handing a chunk of data to NIC Large Segment Offload (LSO), referred to as TCP Segmentation Offload (TSO) in Linux circles, is a technique to copy a large chunk of data from the application process and hand it as-is to the NIC. The protocol stack generates a single set of Ethernet+TCP+IP header to use as a template, and the NIC handles the details of incrementing the sequence number and calculating fresh checksums for a new header prepended to each packet. Chunks of 32K and 64K are common, so the NIC transmits 21 or 42 TCP segments without further intervention from the protocol stack.

The interesting thing about LSO and Jumbo frames is that Jumbo frames no longer make a difference. The CPU only gets involved for every large chunk of data, the overhead is the same whether that chunk turns into 1500 byte or 9000 byte packets on the wire. The main impact of the frame size is the number of ACKs coming back, as most TCP implementations generate an ACK for every other frame. Transmitting jumbo frames would reduce the number of ACKs, but that kind of overhead is below the noise floor. We just don't care.

There is a similar technique for received packets called, imaginatively enough, Large Receive Offload (LRO). For LSO the NIC and protocol software are in control of when data is sent. For LRO, packets just arrive whenever they arrive. The NIC has to gather packets from each flow to hand up in a chunk. Its quite a bit more complex, and doesn't tend to work as well as LSO. As modern web application servers tend to send far more data than they receive, LSO has been of much greater importance than LRO.

Large Segment Offload mostly removed the justification for jumbo frames. Nonetheless support for larger frame sizes is almost universal in modern networking gear, and customers who were already using jumbo frames have generally carried on using them. Moderately larger frame support is also helpful for carriers who want to encapsulate customer traffic within their own headers. I expect hardware designs to continue to accommodate it.

TCP Calcification

There has been a big downside of pervasive use of LSO: it has become the immune response preventing changes in protocols. NIC designs vary widely in their implementation of the technique, and some of them are very rigid. Here "rigid" is a euphemism for "mostly crap." There are NICs which hard-code how to handle protocols as they existed in the early part of this century: Ethernet header, optional VLAN header, IPv4/IPv6, TCP. Add any new option, or any new header, and some portion of existing NICs will not cope with it. Making changes to existing protocols or adding new headers is vastly harder now, as changes are likely to throw the protocol back into the slow zone and render moot any of the benefits it brings.

It used to be that any new TCP extension had to carefully negotiate between sender and receiver in the SYN/SYN+ACK to make sure both sides would support an option. Nowadays due to LSO and to the pervasive use of middleboxes, we basically cannot add options to TCP at all.

I guess the moral is, "be careful what you wish for."

Wednesday, December 7, 2011

Requiem for Jumbo Frames

Wayback machine to 1992

Wayback Machine to 1999

LSO/LRO

TCP Calcification