Friday, September 16, 2011

The Care and Feeding of VXLAN

This is the first of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.


Modern datacenter networks are designed around the requirements for virtualization. There are a number of physical servers connected to the network, each hosting a larger number of virtual servers. The Gentle Reader may be familiar with VMware on the desktop, with NAT to let VMs communicate with the Internet. Datacenter VMs don't work that way. Each VM has its own IP and MAC address, with software beneath the VMs to function as a switch. On the Ethernet we see a huge number of MAC addresses, from all of the VMs throughout the facility.

To rebalance server load it is necessary to support moving a virtual machine from a heavily loaded server to another more lightly loaded one. It is essentially impossible to change the IP address of a VM as part of this move. Though modern server OSes can change IP address without rebooting, doing so interrupts the service as client connections are closed and caches spread far and wide have to time out. To be useful, VM moves have to be mostly invisible and not interrupt services.

Datacenter networkmoving a VM from one physical server to another.

The most straightforward way to accomplish this is to have the servers sit in a single subnet and single broadcast domain, which leads to some truly enormous L2 networks. It is putting pressure on switch designs to support tens of thousands of MAC addresses (thankfully via hash engines, rather than CAMs). Everything we'd learned about networking to this point drove towards Layer3 networks for scalability, but it is all being rewritten for virtualized datacenters.


 

Enter VXLAN

At VMWorld several weeks ago there was an announcement of VXLAN, Virtual eXtensible LANs. I paid little attention at the time, but should have paid more. VXLAN is an encapsulation scheme to carry L2 frames atop L3 networks. Described like that it doesn't sound very interesting, but the details are well thought out. Additionally the RFC is authored by representatives from major virtualization vendors VMware, Citrix, and Red Hat, and by datacenter switch vendors Cisco and Arista. It will appear in real networks relatively quickly.

The bulk of the VXLAN implementation is handled via a tunneling endpoint. This will generally reside within the virtual server, in software running under the VMs. A new component called the VXLAN Tunnel End Point (VTEP) encapsulates frames inside an L3 tunnel. There can be 224 VXLANs, identified by a 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of known destination MAC addresses, and stores the IP address of the tunnel to the remote VTEP to use for each. Unicast frames between VMs are sent directly to the unicast L3 address of the remote VTEP. Multicast and broadcast frames from VMs are sent to a multicast IP group associated with the VNI. The spec is vague on how VNIs map to a multicast IP address, merely saying that a management plane configures it along with VM membership in the VXLAN. Multicast distribution in most networks is something of an afterthought, making the address configurable allows VXLAN to cope with whatever facilities exist.

Overlay network connecting VMs through an L3 switch.

 

A Maze of Twisty Little Passages

VXLAN encapsulates L2 packets from the VMs within an Outer IP header to send across an IP network. The receiving VTEP decapsulates the packet, and consults the Inner headers to figure out how to deliver it to its destination.

Outer MAC, Outer IP, UDP, VXLAN, Inner MAC, Inner Payload, Outer FCS.

The encapsulated packet retains its Inner MAC header and optional Inner VLAN, but has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the VTEP is a software component within the server, prior to hitting any NIC, the frames have no CRC when the VTEP gets them. Therefore there is no integrity protection end to end, from the originating VM to receiving. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

Next article: UDP


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.