This is the second of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.
I strongly recommend reading the first article before this one, to provide background.
UDP Encapsulation
In addition to the IP tunnel header and a VXLAN header, there is also an Outer UDP header. One might reasonably ask why it is there, as VXLAN could have been directly encapsulated within IP.
The UDP header serves an interesting purpose, it isn't there to perform the multiplexing role UDP normally serves. When switches have multiple paths available to a destination, whether an L2 trunk or L3 multipathing, the specific link is chosen by hashing packet headers. Most switch hardware is quite limited in how it computes the hash: the outermost L2/L3/L4 headers. Some chips can examine the inner headers of long-established tunneling protocols like GRE/MAC-in-MAC/IP-in-IP. For a new protocol like VXLAN, it would take years for silicon support for the inner headers to become common.
Therefore the VTEP calculates a hash of the inner packet headers, and places it in the source UDP port where it feeds into LACP/ECMP hash calculation. Existing switch chips get proper load balancing using only the Outer L2/L3/L4 headers, at the cost of 8 bytes of overhead.

New protocols sometimes encapsulate themselves inside UDP headers to more easily traverse firewalls and NAT devices. That isn't what VXLAN is doing, it would be somewhat ludicrous to put firewalls between subnets within a datacenter. In fact, the way VXLAN uses its UDP header can make firewall traversal a bit more challenging. The inner packet headers can hash to a well known UDP port number like 53, making it look like a DNS response, but a firewall attempting to inspect the contents of the frame will not find a valid DNS packet. It would be important to disable any deep packet inspection for packets traveling between VTEP endpoints. If VXLAN is used to extend an L2 network all the way across a WAN the firewall question becomes more interesting. I don't think its a good idea to have a VXLAN cross a WAN, but that will have to be a topic for another day.
VTEP Learning
The VTEP examines the destination MAC address of frames it handles, looking up the IP address of the VTEP for that destination. This MAC:OuterIP mapping table is populated by learning, very much like an L2 switch discovers the port mappings for MAC addresses. When a VM wishes to communicate with another VM it generally first sends a broadcast ARP, which its VTEP will send to the multicast group for its VNI. All of the other VTEPs will learn the Inner MAC address of the sending VM and Outer IP address of its VTEP from this packet. The destination VM will respond to the ARP via a unicast message back to the sender, which allows the original VTEP to learn the destination mapping as well.
When a MAC address moves, the other VTEPs find its new location by the same learning process, using the first packet they see from its new VTEP. Why might a MAC address move? Consider a protocol like VRRP, which fails over a MAC address between two redundant servers. When ownership of a VRRP MAC address switches from one VM to another, all of the other VTEPs on the network need to learn the new MAC:OuterIP association. VRRP typically sends a gratuitous ARP when it fails over, and as a broadcast packet that ARP will be sent to all VTEPs. They learn the new MAC:OuterIP association from that packet.
VRRP nicely sends a gratuitous ARP when the MAC address moves, but not all MAC moves will do so. Consider the case where a running VM is frozen and moved to another server. The VM will resume where it left off, its ARP table fully populated for nodes it is communicating with. It won't send a gratuitous ARP because the VM has no idea that it has moved to a new vserver, and it won't send ARPs for addresses already in its table either. Its possible I've missed some subtlety, but I don't see how remote VTEPs would quickly learn the new location of the MAC address. I think they continue sending to the incorrect VTEP until their entries time out, and then they start flooding to the VXLAN multicast address.
Though it is appealing to let VTEPs track each other automatically using multicast and learning, I suspect beyond a certain scale of network that isn't going to work very well. Multicast frames are not reliably delivered, and because they fan out to all nodes they tend to become ever less reliable as the number of nodes increases. The RFC mentions the possibility of other mechanisms to populate the VTEP tables, including centralized controllers. I suspect a controller will be an important product to offer. Troubleshooting why subsets of VMs transiently lose the ability to communicate after a move or failover would be really annoying. Small networks could rely on multicast, while larger networks could fall back to it if the controller fails.
Suggestions
On the off chance that people read this far, I'll offer a couple suggestions for modifications to the VXLAN specification based on discussion earlier in the article.
- If VXLAN is used to connect remote facilities, it is likely to traverse firewalls. When the VTEP calculates a hash of the Inner headers to place in the UDP source port field, I'd recommend it always set the most significant bit. This restricts the hash to 15 bits, values 32768 - 65535, but avoids any low numbered port number with a defined meaning like DNS. This should still result in good LACP/ECMP hashing, as this makes VXLAN packets look like ephemeral ports used by UDP client applications.
- When a VTEP sees a new source MAC address from a local VM, flood the packet even if the OuterIP of the destination is already known. This gives remote VTEPs a better chance of noticing a MAC move. The VTEP already had to keep track of local MAC addresses to properly deliver received frames, so I suspect there is already a local source learning function.
Next article: A few final VXLAN topics.
footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.


Not quite two years ago in this space
For a number of years switch and router manufacturers competed on protocol support, implementing various extensions to OSPF/BGP/SpanningTree/etc in their software. QFabric is almost completely silent about protocols. In part this is a marketing philosophy: positioning the QFabric as a distributed switch instead of a network means that the protocols running within the fabric are an implementation detail, not something to talk about. I don't know what protocols are run between the nodes of the QFabric, but I'm sure its not Spanning Tree and OSPF.
The flow control is for the whole link. For a switch with multiple downstream ports, there is no way to signal back to the sender that only some of the ports are congested. In this diagram, a single congested port requires the upstream to be flow controlled, even though the other port could accept more packets. Ethernet flow control suffers from head of line blocking, a single congested port will choke off traffic to other uncongested ports.
A common spec for switch silicon in the 1Gbps generation is 48 x 1 Gbps ports plus 4 x 10 Gbps. Depending on the product requirements, 10 Gbps ports can be used for server attachment or as uplinks to build into a larger switch. At first glance the chassis application appears to be somewhat oversubscribed, with 48 Gbs of downlink but only 40 Gbps of uplink. In reality, when used in a chassis the uplink ports will run at 12.5 Gbps to get 50 Gbps of uplink bandwidth.
QFabric consists of edge nodes wired to two or four extremely large
Modular Ethernet switches have line cards which can switch between ports on the card, with fabric cards (also commonly called supervisory modules, route modules, or MSMs) between line cards. One might assume that each level of switching would function like we expect Ethernet switches to work, forwarding based on the L2 or L3 destination address. There are a number of reasons why this doesn't work very well, most troublesome of which are the consistency issues. There is a delay between when a packet is processed by the ingress line card and the fabric, and between the fabric and egress. The L2 and L3 tables can change between the time a packet hits one level of switching and the next, and its very, very hard to design a robust switching platform with so many corner cases and race conditions to worry about.
Therefore all Ethernet switch silicon I know of relies on control headers prepended to the packet. A forwarding decision is made at exactly one place in the system, generally either the ingress line card or the central fabric cards. The forwarding decision includes any rewrites or tunnel encapsulations to be done, and determines the egress port. A header is prepended to the packet for the rest of its trip through the chassis, telling all remaining switch chips what to do with it. To avoid impacting the forwarding rate, these headers replace part of the
Generally the chips are configured to use these prepended control headers only on backplane links, and drop the header before the packet leaves the chassis. There are some exceptions where control headers are carried over external links to another box. Several companies sell variations on the
Because I brought it up earlier, we'll conclude with a discussion of page coloring. I am not satisfied with the
Before fetching a value from memory the CPU consults its cache. The least significant bits of the desired address are an offset into the cache line, generally 4, 5, or 6 bits for a 16/32/64 byte cache line.
Separately, the CPU defines a page size for the virtual memory system. 4 and 8 Kilobytes are common. The least significant bits of the address are the offset within the page, 12 or 13 bits for 4 or 8 K respectively. The most significant bits are a page number, used by the CPU cache as a tag. The hardware fetches the tag of the selected cache lines to check against the upper bits of the desired address. If they match, it is a cache hit and no access to DRAM is needed.
Consider a network using traditional L3 routing: you give each subscriber an IP address on their own IP subnet. You need to have a router address on the same subnet, and you need a broadcast address. Needing 3 IPs per subscriber means a /30. Thats 4 IP addresses allocated per customer.
My current project relies on a large number of