This is the second of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.
I strongly recommend reading the first article before this one, to provide background.
In addition to the IP tunnel header and a VXLAN header, there is also an Outer UDP header. One might reasonably ask why it is there, as VXLAN could have been directly encapsulated within IP.
The UDP header serves an interesting purpose, it isn't there to perform the multiplexing role UDP normally serves. When switches have multiple paths available to a destination, whether an L2 trunk or L3 multipathing, the specific link is chosen by hashing packet headers. Most switch hardware is quite limited in how it computes the hash: the outermost L2/L3/L4 headers. Some chips can examine the inner headers of long-established tunneling protocols like GRE/MAC-in-MAC/IP-in-IP. For a new protocol like VXLAN, it would take years for silicon support for the inner headers to become common.
Therefore the VTEP calculates a hash of the inner packet headers, and places it in the source UDP port where it feeds into LACP/ECMP hash calculation. Existing switch chips get proper load balancing using only the Outer L2/L3/L4 headers, at the cost of 8 bytes of overhead.
New protocols sometimes encapsulate themselves inside UDP headers to more easily traverse firewalls and NAT devices. That isn't what VXLAN is doing, it would be somewhat ludicrous to put firewalls between subnets within a datacenter. In fact, the way VXLAN uses its UDP header can make firewall traversal a bit more challenging. The inner packet headers can hash to a well known UDP port number like 53, making it look like a DNS response, but a firewall attempting to inspect the contents of the frame will not find a valid DNS packet. It would be important to disable any deep packet inspection for packets traveling between VTEP endpoints. If VXLAN is used to extend an L2 network all the way across a WAN the firewall question becomes more interesting. I don't think its a good idea to have a VXLAN cross a WAN, but that will have to be a topic for another day.
The VTEP examines the destination MAC address of frames it handles, looking up the IP address of the VTEP for that destination. This MAC:OuterIP mapping table is populated by learning, very much like an L2 switch discovers the port mappings for MAC addresses. When a VM wishes to communicate with another VM it generally first sends a broadcast ARP, which its VTEP will send to the multicast group for its VNI. All of the other VTEPs will learn the Inner MAC address of the sending VM and Outer IP address of its VTEP from this packet. The destination VM will respond to the ARP via a unicast message back to the sender, which allows the original VTEP to learn the destination mapping as well.
When a MAC address moves, the other VTEPs find its new location by the same learning process, using the first packet they see from its new VTEP. Why might a MAC address move? Consider a protocol like VRRP, which fails over a MAC address between two redundant servers. When ownership of a VRRP MAC address switches from one VM to another, all of the other VTEPs on the network need to learn the new MAC:OuterIP association. VRRP typically sends a gratuitous ARP when it fails over, and as a broadcast packet that ARP will be sent to all VTEPs. They learn the new MAC:OuterIP association from that packet.
VRRP nicely sends a gratuitous ARP when the MAC address moves, but not all MAC moves will do so. Consider the case where a running VM is frozen and moved to another server. The VM will resume where it left off, its ARP table fully populated for nodes it is communicating with. It won't send a gratuitous ARP because the VM has no idea that it has moved to a new vserver, and it won't send ARPs for addresses already in its table either. Its possible I've missed some subtlety, but I don't see how remote VTEPs would quickly learn the new location of the MAC address. I think they continue sending to the incorrect VTEP until their entries time out, and then they start flooding to the VXLAN multicast address.
Though it is appealing to let VTEPs track each other automatically using multicast and learning, I suspect beyond a certain scale of network that isn't going to work very well. Multicast frames are not reliably delivered, and because they fan out to all nodes they tend to become ever less reliable as the number of nodes increases. The RFC mentions the possibility of other mechanisms to populate the VTEP tables, including centralized controllers. I suspect a controller will be an important product to offer. Troubleshooting why subsets of VMs transiently lose the ability to communicate after a move or failover would be really annoying. Small networks could rely on multicast, while larger networks could fall back to it if the controller fails.
On the off chance that people read this far, I'll offer a couple suggestions for modifications to the VXLAN specification based on discussion earlier in the article.
- If VXLAN is used to connect remote facilities, it is likely to traverse firewalls. When the VTEP calculates a hash of the Inner headers to place in the UDP source port field, I'd recommend it always set the most significant bit. This restricts the hash to 15 bits, values 32768 - 65535, but avoids any low numbered port number with a defined meaning like DNS. This should still result in good LACP/ECMP hashing, as this makes VXLAN packets look like ephemeral ports used by UDP client applications.
- When a VTEP sees a new source MAC address from a local VM, flood the packet even if the OuterIP of the destination is already known. This gives remote VTEPs a better chance of noticing a MAC move. The VTEP already had to keep track of local MAC addresses to properly deliver received frames, so I suspect there is already a local source learning function.
Next article: A few final VXLAN topics.
footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.