Coding Relic: September 2011

Monday, September 26, 2011

Peer Review Terminology

Many companies now include peer feedback as part of the annual review process. Each employee nominates several of their colleagues to write a review of their work from the previous year. If you are new to this process, here is some terminology which may be useful.

peerrihelion [peer-ree-heel-yun] (noun)	The point at which 50% of requested peer reviews are complete.
peersuasion [peer-swey-zhuhn] (noun)	A particularly glowing peer review.
peerjury [peer-juh-ree] (noun)	An astonishingly glowing peer review.
peerplexed [peer-plekst] (adjective)	What exactly did they work on, anyway?
peerrational [peer-rash-uh-nl] (adjective)	Why am I reviewing this person?
peergatory [peer-guh-tohr-ee] (noun)	The set of peer reviews which may have to be declined due to lack of time.
peerfectionist [peer-fek-shuh-nist] (noun)	I spent a long time obsessing over wording.
peeriodic [peer-ee-od-ik] (adjective)	Maybe I'll just copy some of what I wrote last year.
peerritation [peer-ree-tay-shun] (noun)	Unreasonable hostility felt toward the subject of the last peer review left to be written.
peersecute [peer-seh-kyoot] (verb)	How nice, everyone asked for my review.
peersona non grata [peer-soh-nah nohn grah-tah] (noun)	Nobody asked for my review?
peeregular [peer-reg-gyu-ler] (adjective)	An incomplete peer review, submitted anyway, just before the deadline.

Sunday, September 18, 2011

VXLAN Conclusion

This is the third and final article in a series about VXLAN. I recommend reading the first and second articles before this one. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

Foreign Gateways

Though I've consistently described VXLAN communications as occurring between VMs, many datacenters have a mix of virtual servers with single-instance physical servers. Something has to provide the VTEP function for all nodes on the network, but it doesn't have to be the server itself. A Gateway function can bridge to physical L2 networks, and with representatives of several switch companies as authors of the RFC this seems likely to materialize within the networking gear itself. The Gateway can also be provided by a server sitting within the same L2 domain as the servers it handles.

Gateway to communicate with other physical servers on an L2 segment.

Even if the datacenter consists entirely of VMs, a Gateway function is still needed in the switch. To communicate with the Internet (or anything else outside of their subnet) the VMs will ARP for their next hop router. This router has to have a VTEP.

Transition Strategy

Mixture of VTEP-enabled servers with non requires a gateway function somewhere I'm tempted to say there isn't a transition strategy. Thats a bit too harsh in that the Gateway function just described can serve as a proxy, but its not far from the mark. As described in the RFC, the VTEP assumes that all destination L2 addresses will be served by a remote VTEP somewhere. If the VTEP doesn't know the L3 address of the remote node to send to, it floods the packet to all VTEPs using multicast. There is no provision for direct L2 communication to nodes which have no VTEP. It is assumed that an existing installation of VMs on a VLAN will be taken out of service, and all nodes reconfigured to use VXLAN. VLANs can be converted individually, but there is no provision operation with a mixed set of VTEP-enabled and non-VTEP-enabled nodes on an existing VLAN.

For an existing datacenter which desires to avoid scheduling downtime for an entire VLAN, one transition strategy would use a VTEP Gateway as the first step. When the first server is upgraded to use VXLAN and have its own VTEP, all of its packets to other servers will go through this VTEP Gateway. As additional servers are upgraded they will begin communicating directly between VTEPs, and rely on the Gateway to maintain communication with the rest of their subnet.

Where would the Gateway function go? During the transition, which could be lengthy, the Gateway VTEP will be absolutely essential for operation. It shouldn't be a single point of failure, and this should trigger the network engineer's spidey sense about adding a new critical piece of infrastructure. It will need to be monitored, people will need to be trained in what to do if it fails, etc. Therefore it seems far more likely that customers will choose to upgrade their switches to include the VTEP Gateway function, so as not to add a new critical bit of infrastructure.

Controller to the Rescue?

Mixture of VTEP-enabled servers with non requires a gateway function somewhere What makes this transition strategy difficult to accept is that VMs have to be configured to be part of a VXLAN. They have to be assigned to a particular VNI, and that VNI has to be given an IP multicast address to use for flooding. Therefore something, somewhere knows the complete list of VMs which should be part of the VXLAN. In Rumsfeldian terms, there are only known unknown addresses and no unknown unknowns. That is, the VTEP can know the complete list of destination MAC addresses it is supposed to be able to reach via VXLAN. The only unknown is the L3 address of the remote VTEP. If the VTEP encounters a destination MAC address which it doesn't know about, it doesn't have to assume it is attached to a VTEP somewhere. It could know that some MAC addresses are reached directly, without VXLAN encapsulation.

The previous article in this series brought up the reliance on multicast for learning as an issue, and suggested that a VXLAN controller would be an important product to offer. That controller could also provide a better transition strategy, allowing VTEPs to know that some L2 addresses should be sent directly to the wire without a VXLAN tunnel. This doesn't make the controller part of the dataplane: it is only involved when stations are added or removed from the VXLAN. During normal forwarding, the controller is not involved.

It is safe to say that the transition strategy for existing, brownfield datacenter networks is the part of the VXLAN proposal which I like the least.

Other miscellaneous notes

VXLAN prepends 42 bytes of headers to the original packet. To avoid IP fragmentation the L3 network needs to handle a slightly larger frame size than standard Ethernet. Support for Jumbo frames is almost universal in networking gear at this point, this should not be an issue.

There is only a single multicast group per VNI. All broadcast and multicast frames in that VXLAN will be sent to that one IP multicast group and delivered to all VTEPs. The VTEP would likely run an IGMP Snooping function locally to determine whether to deliver multicast frames to its VMs. VXLAN as currently defined can't prune the delivery tree, all VTEPs must receive all frames. It would be nice to be able to prune delivery within the network, and not deliver to VTEPs which have no subscribing VMs. This would require multiple IP multicast groups per VNI, which would complicate the proposal.

Conclusion

I like the VXLAN proposal. I view the trend toward having enormous L2 networks in datacenters as disturbing, and see VXLAN as a way to give the VMs the network they want without tying it to the underlying physical infrastructure. It virtualizes the network to meet the needs of the virtual servers.

After beginning to publish these articles on VXLAN I became aware of another proposal, NVGRE. There appear to be some similarities, including the use of IP multicast to fan out L2 broadcast/multicast frames, and the two proposals even share an author in common. NVGRE uses GRE encapsulation instead of the UDP+VXLAN header, with multiple L2 addresses to provide load balancing across LACP/ECMP links. It will take a while to digest, but I expect to write some thoughts about NVGRE in the future.

Many thanks to Ken Duda, whose patient explanations of VXLAN on Google+ made this writeup possible.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Saturday, September 17, 2011

VXLAN Part Deux

This is the second of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

I strongly recommend reading the first article before this one, to provide background.

UDP Encapsulation

In addition to the IP tunnel header and a VXLAN header, there is also an Outer UDP header. One might reasonably ask why it is there, as VXLAN could have been directly encapsulated within IP.

Four paths between routers, hashing headers chooses one. The UDP header serves an interesting purpose, it isn't there to perform the multiplexing role UDP normally serves. When switches have multiple paths available to a destination, whether an L2 trunk or L3 multipathing, the specific link is chosen by hashing packet headers. Most switch hardware is quite limited in how it computes the hash: the outermost L2/L3/L4 headers. Some chips can examine the inner headers of long-established tunneling protocols like GRE/MAC-in-MAC/IP-in-IP. For a new protocol like VXLAN, it would take years for silicon support for the inner headers to become common.

Therefore the VTEP calculates a hash of the inner packet headers, and places it in the source UDP port where it feeds into LACP/ECMP hash calculation. Existing switch chips get proper load balancing using only the Outer L2/L3/L4 headers, at the cost of 8 bytes of overhead.

VTEP calculates hash of inner packet headers, places it in the UDP source port.

New protocols sometimes encapsulate themselves inside UDP headers to more easily traverse firewalls and NAT devices. That isn't what VXLAN is doing, it would be somewhat ludicrous to put firewalls between subnets within a datacenter. In fact, the way VXLAN uses its UDP header can make firewall traversal a bit more challenging. The inner packet headers can hash to a well known UDP port number like 53, making it look like a DNS response, but a firewall attempting to inspect the contents of the frame will not find a valid DNS packet. It would be important to disable any deep packet inspection for packets traveling between VTEP endpoints. If VXLAN is used to extend an L2 network all the way across a WAN the firewall question becomes more interesting. I don't think its a good idea to have a VXLAN cross a WAN, but that will have to be a topic for another day.

VTEP Learning

VTEP Table of MAC:OuterIP mappings. The VTEP examines the destination MAC address of frames it handles, looking up the IP address of the VTEP for that destination. This MAC:OuterIP mapping table is populated by learning, very much like an L2 switch discovers the port mappings for MAC addresses. When a VM wishes to communicate with another VM it generally first sends a broadcast ARP, which its VTEP will send to the multicast group for its VNI. All of the other VTEPs will learn the Inner MAC address of the sending VM and Outer IP address of its VTEP from this packet. The destination VM will respond to the ARP via a unicast message back to the sender, which allows the original VTEP to learn the destination mapping as well.

When a MAC address moves, the other VTEPs find its new location by the same learning process, using the first packet they see from its new VTEP. Why might a MAC address move? Consider a protocol like VRRP, which fails over a MAC address between two redundant servers. When ownership of a VRRP MAC address switches from one VM to another, all of the other VTEPs on the network need to learn the new MAC:OuterIP association. VRRP typically sends a gratuitous ARP when it fails over, and as a broadcast packet that ARP will be sent to all VTEPs. They learn the new MAC:OuterIP association from that packet.

VRRP nicely sends a gratuitous ARP when the MAC address moves, but not all MAC moves will do so. Consider the case where a running VM is frozen and moved to another server. The VM will resume where it left off, its ARP table fully populated for nodes it is communicating with. It won't send a gratuitous ARP because the VM has no idea that it has moved to a new vserver, and it won't send ARPs for addresses already in its table either. Its possible I've missed some subtlety, but I don't see how remote VTEPs would quickly learn the new location of the MAC address. I think they continue sending to the incorrect VTEP until their entries time out, and then they start flooding to the VXLAN multicast address.

Multicast frame delivered to 3 VTEPs but dropped before reaching one. Though it is appealing to let VTEPs track each other automatically using multicast and learning, I suspect beyond a certain scale of network that isn't going to work very well. Multicast frames are not reliably delivered, and because they fan out to all nodes they tend to become ever less reliable as the number of nodes increases. The RFC mentions the possibility of other mechanisms to populate the VTEP tables, including centralized controllers. I suspect a controller will be an important product to offer. Troubleshooting why subsets of VMs transiently lose the ability to communicate after a move or failover would be really annoying. Small networks could rely on multicast, while larger networks could fall back to it if the controller fails.

Suggestions

On the off chance that people read this far, I'll offer a couple suggestions for modifications to the VXLAN specification based on discussion earlier in the article.

If VXLAN is used to connect remote facilities, it is likely to traverse firewalls. When the VTEP calculates a hash of the Inner headers to place in the UDP source port field, I'd recommend it always set the most significant bit. This restricts the hash to 15 bits, values 32768 - 65535, but avoids any low numbered port number with a defined meaning like DNS. This should still result in good LACP/ECMP hashing, as this makes VXLAN packets look like ephemeral ports used by UDP client applications.
When a VTEP sees a new source MAC address from a local VM, flood the packet even if the OuterIP of the destination is already known. This gives remote VTEPs a better chance of noticing a MAC move. The VTEP already had to keep track of local MAC addresses to properly deliver received frames, so I suspect there is already a local source learning function.

Next article: A few final VXLAN topics.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Friday, September 16, 2011

The Care and Feeding of VXLAN

This is the first of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

Modern datacenter networks are designed around the requirements for virtualization. There are a number of physical servers connected to the network, each hosting a larger number of virtual servers. The Gentle Reader may be familiar with VMware on the desktop, with NAT to let VMs communicate with the Internet. Datacenter VMs don't work that way. Each VM has its own IP and MAC address, with software beneath the VMs to function as a switch. On the Ethernet we see a huge number of MAC addresses, from all of the VMs throughout the facility.

To rebalance server load it is necessary to support moving a virtual machine from a heavily loaded server to another more lightly loaded one. It is essentially impossible to change the IP address of a VM as part of this move. Though modern server OSes can change IP address without rebooting, doing so interrupts the service as client connections are closed and caches spread far and wide have to time out. To be useful, VM moves have to be mostly invisible and not interrupt services.

Datacenter networkmoving a VM from one physical server to another.

The most straightforward way to accomplish this is to have the servers sit in a single subnet and single broadcast domain, which leads to some truly enormous L2 networks. It is putting pressure on switch designs to support tens of thousands of MAC addresses (thankfully via hash engines, rather than CAMs). Everything we'd learned about networking to this point drove towards Layer3 networks for scalability, but it is all being rewritten for virtualized datacenters.

Enter VXLAN

At VMWorld several weeks ago there was an announcement of VXLAN, Virtual eXtensible LANs. I paid little attention at the time, but should have paid more. VXLAN is an encapsulation scheme to carry L2 frames atop L3 networks. Described like that it doesn't sound very interesting, but the details are well thought out. Additionally the RFC is authored by representatives from major virtualization vendors VMware, Citrix, and Red Hat, and by datacenter switch vendors Cisco and Arista. It will appear in real networks relatively quickly.

The bulk of the VXLAN implementation is handled via a tunneling endpoint. This will generally reside within the virtual server, in software running under the VMs. A new component called the VXLAN Tunnel End Point (VTEP) encapsulates frames inside an L3 tunnel. There can be 224 VXLANs, identified by a 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of known destination MAC addresses, and stores the IP address of the tunnel to the remote VTEP to use for each. Unicast frames between VMs are sent directly to the unicast L3 address of the remote VTEP. Multicast and broadcast frames from VMs are sent to a multicast IP group associated with the VNI. The spec is vague on how VNIs map to a multicast IP address, merely saying that a management plane configures it along with VM membership in the VXLAN. Multicast distribution in most networks is something of an afterthought, making the address configurable allows VXLAN to cope with whatever facilities exist.

Overlay network connecting VMs through an L3 switch.

A Maze of Twisty Little Passages

VXLAN encapsulates L2 packets from the VMs within an Outer IP header to send across an IP network. The receiving VTEP decapsulates the packet, and consults the Inner headers to figure out how to deliver it to its destination.

The encapsulated packet retains its Inner MAC header and optional Inner VLAN, but has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the VTEP is a software component within the server, prior to hitting any NIC, the frames have no CRC when the VTEP gets them. Therefore there is no integrity protection end to end, from the originating VM to receiving. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

Next article: UDP

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, September 4, 2011

Telling Strangers Where You Are

foursquare Ten One Hundred badge, for a thousand checkins. Not quite two years ago in this space I wrote about how I use foursquare. I've continued using the service since then, passing 1,000 checkins several months ago.

The first generation of location based services like foursquare have paid a lot of attention to privacy concerns. Explicit connection to other users is required in order to allow them to see your checkins. To do otherwise would have been perceived as creepy, the go-to label for vague privacy concerns. For those who do want to make their checkins public, Foursquare has an option to publish checkins to Twitter.

Yet social norms evolve, even in the span of just two years. Facebook Places and Google+ both offer checkins as a feature of their respective services. I've been periodically checking in on Google+ for several months. For routine trips I check in to a very limited circle of people, not so much out of concern about privacy as to not be spammy. For well-known venues I've been checking in publicly, and something fascinating happens: well-known venues are really well-known. Lots of people have been there, and they chime in with commentary and suggestions of things to see and do. Our trip to the Monterey Bay Aquarium was much improved by real-time suggestions from Google+ users, and pictures from the trip in turn made a couple other people think about going back.

Jeff Jarvis has long made the argument about the benefits of publicness, and that overemphasizing concerns about privacy undermines the benefits we could get by being connected. We use nebulous terms in justifying privacy like creepy, and stifle discussion of the value of openness. Our brains are really good at concocting (unlikely) scenarios of the bad things which could happen from sharing information, and not so good at seeing the good which can come of it. I'm definitely seeing that effect with public checkins, it seems scary but yet there is tremendous value in sharing them widely.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the social label.