Friday, December 30, 2011


Earlier this week Sam Biddle of Gizmodo published How the Hashtag Is Ruining the English Language, decrying the use of hashtags to add additional color or meaning to text. Quoth the article, "The hashtag is a vulgar crutch, a lazy reach for substance in the personal void – written clipart." #getoffhislawn

Written communication has never been as effective as in-person conversation, nor even as simple audio via telephone. Presented with plain text, we lack a huge array of additional channels for meaning: posture, facial expression, tone, cadence, gestures, etc. Smileys can be seen as an early attempt to add emotional context to online communication, albeit a limited one. #deathtosmileys

Yet language evolves to suit our needs and to fit advances communications technology. A specific example: in the US we commonly say "Hello" as a greeting. Its considered polite, and it has always been the common practice... except that it hasn't. The greeting Hello entered the English language in the mid 19th century with the invention of the telephone. The custom until that time of speaking only after a proper introduction simply didn't work on the telephone, it wasn't practical over the distances involved to coordinate so many people. Use of Hello spread from the telephone into all areas of interaction. I suspect there were people at the time who bemoaned and berated the verbal crutch of the "hello" as they watched it push aside the more finely crafted greetings of the time. #getofftheirlawn

So now we have hashtags. Spawned by the space-constrained medium of the tweet, they are now spreading to other written forms. That they find traction in longer form media is an indication that they fill a need. They supply context, overlay emotional meaning, and convey intent, all lacking in current practice. Its easy to label hashtags as lazy or somehow vulgar. "[W]hy the need for metadata when regular words have been working so well?" questions the Gizmodo piece. Yet the sad reality is that regular words haven't been working so well. Even in the spoken word there is an enormous difference between oratory and casual conversation. A moving speech, filled with meaning in every phrase, takes a long time to prepare and rehearse. Its a rare event, not the norm day to day. The same holds true in the written word. "I apologize that this letter is so long - I lacked the time to make it short." quipped Blaise Pascal in the 17th century.


Gizmodo even elicited a response from Noam Chomsky, probably via email, "Don't use Twitter, almost never see it."

What I find most interesting about Chomsky's response is that it so perfectly illustrates the problem which emotive hashtags try to solve: his phrasing is slightly ambiguous. It could be interpreted as Chomsky saying he doesn't use Twitter and so never sees hashtags, or that anyone bothered by hashtags shouldn't use Twitter so they won't see them. He probably means the former, but in an in-person conversation there would be no ambiguity. Facial expression would convey his unfamiliarity with Twitter.

For Chomsky, adding a hashtag would require extra thought and effort which could instead have gone into rewording the sentence. That, I think, is the key. For those to whom hashtags are extra work, it all seems silly and even stupid. For those whose main form of communication is short texts, it doesn't. #getoffmylawntoo

Thursday, December 22, 2011

Refactoring Is Everywhere

Large Ditch Witch

The utilities used to run from poles, now they are underground. The functionality is unchanged, but the implementation is cleaner.

Friday, December 16, 2011

The Ada Initiative 2012

Donate to the Ada InitiativeEarlier this year I donated seed funding to the Ada Initiative, a non-profit organization dedicated to increasing participation of women in open technology and culture. One of their early efforts was development of an example anti-harassment policy for conference organizers, attempting to counter a number of high profile incidents of sexual harassment at events. Lacking any sort of plan for what to do after such an incident, conference organizers often did not respond effectively. This creates an incredibly hostile environment, and makes it even harder for women in technology to advance their careers through networking. Developing a coherent, written policy is a first step toward solving the problem.

The Ada Initiative is now raising funds for 2012 activities, including:

  • Ada’s Advice: a guide to resources for helping women in open tech/culture
  • Ada’s Careers: a career development community for women in open tech/culture
  • First Patch Week: help women write and submit a patch in a week
  • AdaCamp and AdaCon: (un)conferences for women in open tech/culture
  • Women in Open Source Survey: annual survey of women in open source


For me personally

There are many barriers discouraging women from participating in the technology field. Donating to the Ada Initiative is one thing I'm doing to try to change that. I'm posting this to ask other people to join me in supporting this effort.

My daughter is 6. The status quo is unacceptable. Time is short.

My daughter wearing Google hat

Monday, December 12, 2011

Go Go Gadget Google Currents!

Last week Google introduced Currents, a publishing and distribution platform for smartphones and tablets. I decided to publish this blog as an edition, and wanted to walk through how it works.


Publishing an Edition

Google Currents producer screenshotSetting up the publisher side of Google Currents was straightforward. I entered data in a few tabs of the interface:

Edition settings: Entered the name for the blog, and the Google Analytics ID used on the web page.

Sections: added a "Blog" section, sourced from the RSS feed for this blog. I use Feedburner to post-process the raw RSS feed coming from Blogger. However I saw no difference in the layout of the articles in Google Currents between Feedburner and the Blogger feed. As Currents provides statistics using Google Analytics, I didn't want to have double counting by having the same users show up in the Feedburner analytics. I went with the RSS feed from Blogger.

Sections->Blog: After adding the Blog section I customized its CSS slightly, to use the paper tape image from the blog masthead as a header. I uploaded a 400x50 version of the image to the Media Library, and modified the CSS like so:

.customHeader {
  background-color: #f5f5f5;
  display: -webkit-box;
  background-image:  url('attachment/CAAqBggKMNPYLDDD3Qc-GoogleCurrentsLogo.jpg');
  background-repeat: repeat-x;
  height: 50px;
  -webkit-box-flex: 0;
  -webkit-box-orient: horizontal;
  -webkit-box-pack: center;

Manage Articles: I didn't do anything special here. Once the system has fetched content from RSS it is possible to tweak its presentation here, but I doubt I will do that. There is a limit to the amount of time I'll spend futzing.

Media Library: I uploaded the header graphic to use in the Sections tab.

Grant access: anyone can read this blog.

Distribute: I had to click to verify content ownership. As I had already gone through the verification process for Google Webmaster Tools, the Producer verification went through without additional effort. I then clicked "Distribute" and voila!


The Point?

iPad screenshot of this site in Google CurrentsMuch of the publisher interface concerns formatting and presentation of articles. RSS feeds generally require significant work on the formatting to look reasonable, a service performed by Feedburner and by tools like Flipboard and Google Currents. Nonetheless, I don't think the formatting is the main point, presentation is a means to an end. RSS is a reasonable transport protocol, but people have pressed it into service as the supplier of presentation and layout as well by wrapping a UI around it. Its not very good at it. Publishing tools have to expend effort on presentation and layout to make it useable.

Nonetheless, for me at least, the main point of publishing to Google Currents is discoverability. I'm hopeful it will evolve into a service which doesn't just show me material I already know I'm interested in, but also becomes good at suggesting new material which fits my interests.


Community Trumps Content

A concern has been expressed that content distribution tools like this, which use web protocols but are not a web page, will kill off the blog comments which motivate many smaller sites to continue publishing. The thing is, in my experience at least, blog comments all but died long ago. Presentation of the content had nothing to do with it: Community trumps Content. That is, people motivated to leave comments tend to gravitate to an online community where they can interact. They don't confine themselves to material from a single site. Only the most massive blogs have the gravitational attraction to hold a community together. The rest quickly lose their atmosphere to Reddit/Facebook/Google+/etc. I am grateful when people leave comments on the blog, but I get just as much edification from a comment on a social site, and just as much consternation if the sentiment is negative, as if it is here. It is somewhat more difficult for me to find comments left on social sites, but let me be perfectly clear: that is my problem, and my job to stay on top of.


The Mobile Web

One other finding from setting up Currents: the Blogger mobile templates are quite good. The formatting of this site in a mobile browser is very nice, and similar to the formatting which Currents comes up with. To me Currents is mostly about discoverability, not just presentation.

Wednesday, December 7, 2011

Requiem for Jumbo Frames

This weekend Greg Ferro published an article about jumbo frames. He points to recent measurements showing no real benefit with large frames. Some years ago I worked on NIC designs, and at the time we talked about Jumbo frames a lot. It was always a tradeoff: improve performance by sacrificing compatibility, or live with the performance until hardware designs could make the 1500 byte MTU be as efficient as jumbo frames. The latter school of thought won out, and they delivered on it. Jumbo frames no longer offer a significant performance advantage.

Roughly speaking, software overhead for a networking protocol stack can be divided into two chunks:

  • Per-byte which increases with each byte of data sent. Data copies, encryption, checksums, etc make up this kind of overhead.
  • Per-packet which increases with each packet regardless of how big the packet is. Interrupts, socket buffer manipulation, protocol control block lookups, and context switches are examples of this kind of overhead.


Wayback machine to 1992

I'm going to talk about the evolution of operating systems and NICs starting from the 1990s, but will focus on Unix systems. DOS and MacOS 6.x were far more common back then, but modern operating systems evolved more similarly to Unix than to those environments.

Address spaces in user space, kernel, and NIC hardwareLets consider a typical processing path for sending a packet in a Unix system in the early 1990s:

  1. Application calls write(). System copies a chunk of data into the kernel, to mbufs/mblks/etc.
  2. Kernel buffers handed to TCP/IP stack, which looks up the protocol control block (PCB) for the socket.
  3. Stack calculates a TCP checksum and populates the TCP, IP, and Ethernet headers.
  4. Ethernet driver copies kernel buffers out to the hardware. Programmed I/O using the CPU to copy was quite common in 1992.
  5. Hardware interrupts when the transmission is complete, allowing the driver to send another packet.

Altogether the data was copied two and a half times: from user space to kernel, from kernel to NIC, plus a pass over the data to calculate the TCP checksum. There were additionally per packet overheads in looking up the PCB, populating headers, and handling interrupts.

The receive path was similar, with a NIC interrupt kicking off processing of each packet and two and a half copies up to the receiving application. There was more per-packet overhead for receive: where transmit could look up the PCB once and process a sizable chunk of data from the application in one swoop, RX always gets one packet at a time.

Jumbo frames were a performance advantage in this timeframe, but not a huge one. Larger frames reduced the per-packet overhead, but the per-byte overheads were significant enough to dominate the performance numbers.


Wayback Machine to 1999

An early optimization was elimination of the separate pass over the data for the TCP checksum. It could be folded into one of the data copies, and NICs also quickly added hardware support. [Aside: the separate copy and checksum passes in 4.4BSD allowed years of academic papers to be written, putting whatever cruft they liked into the protocol, yet still portraying it as a performance improvement by incidentally folding the checksum into a copy.] NICs also evolved to be DMA devices; the memory subsystem still had to bear the overhead of the copy to hardware, but the CPU load was alleviated. Finally, operating systems got smarter about leaving gaps for headers when copying data into the kernel, eliminating a bunch of memory allocation overhead to hold the TCP/IP/Ethernet headers.

Packet size vs throughput in 2000, 2.5x for 9180 byte vs 1500I have data on packet size versus throughput in this timeframe, collected in the last months of 2000. It was gathered for a presentation at LCN 2000. It used an OC-12 ATM interface, where LAN emulation allowed MTUs up to 18 KBytes. I had to find an old system to run these, the modern systems of the time could almost max out the OC-12 link with 1500 byte packets. I recall it being a Sparcstation-20. The ATM NIC supported TCP checksums in hardware and used DMA.

Roughly the year 1999 was the peak of when jumbo frames would have been most beneficial. Considerable work had been done by that point to reduce per-byte overheads, eliminating the separate checksumming pass and offloading data movement from the CPU. Some work had been done to reduce the per-packet overhead, but not as much. After 1999 additional hardware focussed on reducing the per-packet overhead, and jumbo frames gradually became less of a win.



Protocol stack handing a chunk of data to NICLarge Segment Offload (LSO), referred to as TCP Segmentation Offload (TSO) in Linux circles, is a technique to copy a large chunk of data from the application process and hand it as-is to the NIC. The protocol stack generates a single set of Ethernet+TCP+IP header to use as a template, and the NIC handles the details of incrementing the sequence number and calculating fresh checksums for a new header prepended to each packet. Chunks of 32K and 64K are common, so the NIC transmits 21 or 42 TCP segments without further intervention from the protocol stack.

The interesting thing about LSO and Jumbo frames is that Jumbo frames no longer make a difference. The CPU only gets involved for every large chunk of data, the overhead is the same whether that chunk turns into 1500 byte or 9000 byte packets on the wire. The main impact of the frame size is the number of ACKs coming back, as most TCP implementations generate an ACK for every other frame. Transmitting jumbo frames would reduce the number of ACKs, but that kind of overhead is below the noise floor. We just don't care.

There is a similar technique for received packets called, imaginatively enough, Large Receive Offload (LRO). For LSO the NIC and protocol software are in control of when data is sent. For LRO, packets just arrive whenever they arrive. The NIC has to gather packets from each flow to hand up in a chunk. Its quite a bit more complex, and doesn't tend to work as well as LSO. As modern web application servers tend to send far more data than they receive, LSO has been of much greater importance than LRO.

Large Segment Offload mostly removed the justification for jumbo frames. Nonetheless support for larger frame sizes is almost universal in modern networking gear, and customers who were already using jumbo frames have generally carried on using them. Moderately larger frame support is also helpful for carriers who want to encapsulate customer traffic within their own headers. I expect hardware designs to continue to accommodate it.


TCP Calcification

There has been a big downside of pervasive use of LSO: it has become the immune response preventing changes in protocols. NIC designs vary widely in their implementation of the technique, and some of them are very rigid. Here "rigid" is a euphemism for "mostly crap." There are NICs which hard-code how to handle protocols as they existed in the early part of this century: Ethernet header, optional VLAN header, IPv4/IPv6, TCP. Add any new option, or any new header, and some portion of existing NICs will not cope with it. Making changes to existing protocols or adding new headers is vastly harder now, as changes are likely to throw the protocol back into the slow zone and render moot any of the benefits it brings.

It used to be that any new TCP extension had to carefully negotiate between sender and receiver in the SYN/SYN+ACK to make sure both sides would support an option. Nowadays due to LSO and to the pervasive use of middleboxes, we basically cannot add options to TCP at all.

I guess the moral is, "be careful what you wish for."

Monday, November 28, 2011

QFabric Followup

In August this site published a series of posts about the Juniper QFabric. Since then Juniper has released hardware documentation for the QFabric components, so its time for a followup.

QF edge Nodes, Interconnects, and DirectorsQFabric consists of Nodes at the edges wired to large Interconnect switches in the core. The whole collection is monitored and managed by out of band Directors. Juniper emphasizes that the QFabric should be thought of as a single distributed switch, not as a network of individual switches. The entire QFabric is managed as one entity.

Control header prepended to frameThe fundamental distinction between QFabric and conventional switches is in the forwarding decision. In a conventional switch topology each layer of switching looks at the L2/L3 headers to figure out what to do. The edge switch sends the packet to the distribution switch, which examines the headers again before sending the packet on towards the core (which examines the headers again). QFabric does not work this way. QFabric functions much more like the collection of switch chips inside a modular chassis: the forwarding decision is made by the ingress switch and is conveyed through the rest of the fabric by prepending control headers. The Interconnect and egress Node forward the packet according to its control header, not via another set of L2/L3 lookups.


Node Groups

The Hardware Documentation describes two kinds of Node Groups, Server and Network, which gather multiple edge Nodes together for common purposes.

  • Server Node Groups are straightforward: normally the edge Nodes are independent, connecting servers and storage to the fabric. Pairs of edge switches can be configured as Server Node Groups for redundancy, allowing LAG groups to span the two switches.
  • Network Node Groups configure up to eight edge Nodes to interconnect with remote networks. Routing protocols like BGP or OSPF run on the Director systems, so the entire Group shares a common Routing Information Base and other data.

Why have Groups? Its somewhat easier to understand the purpose of the Network Node Group: routing processes have to be spun up on the Directors, and perhaps those processes have to point to some distinct entity to operate with. Why have Server Node Groups, though? Redundant server connections are certainly beneficial, but why require an additional fabric configuration to allow it?

Ingress fanout to four LAG member portsI don't know the answer, but I suspect it has to do with Link Aggregation (LAG). Server Node Groups allow a LAG to be configured using ports spanning the two Nodes. In a chassis switch, LAG is handled by the ingress chip. It looks up the destination address to find the destination port. Every chip knows the membership of all LAGs in the chassis. The ingress chip computes a hash of the packet to pick which LAG member port to send the packet to. This is how LAG member ports can be on different line cards, the ingress port sends it to the correct card.

Ingress fanout to four LAG member portsThe downside of implementing LAG at ingress is that every chip has to know the membership of all LAGs in the system. Whenever a LAG member port goes down, all chips have to be updated to stop using it. With QFabric, where ingress chips are distributed across a network and the largest fabric could have thousands of server LAG connections, updating all of the Nodes whenever a link goes down could take a really long time. LAG failure is supposed to be quick, with minimal packet loss when a link fails. Therefore I wonder if Juniper has implemented LAG a bit differently, perhaps by handling member port selection in the Interconnect, in order to minimize the time to handle a member port failure.

I feel compelled to emphasize again: I'm making this up. I don't know how QFabric is implemented nor why Juniper made the choices they made. Its just fun to speculate.


Virtualized Junos

Regarding the Director software, the Hardware Documentation says, "[Director devices] run the Junos operating system (Junos OS) on top of a CentOS foundation." Now that is an interesting choice. Way, way back in the mists of time, Junos started from NetBSD as its base OS. NetBSD is still a viable project and runs on modern x86 machines, yet Juniper chose to hoist Junos atop a Linux base instead.

I suspect that in the intervening time, the Junos kernel and platform support diverged so far from NetBSD development that it became impractical to integrate recent work from the public project. Juniper would have faced a substantial effort to handle modern x86 hardware, and chose instead to virtualize the Junos kernel in a VM whose hardware was easier to support. I'll bet the CentOS on the Director is the host for a Xen hypervisor.

Update: in the comments, Brandon Bennett and Julien Goodwin both note that Junos used FreeBSD as its base OS, not NetBSD.

Aside: with network OSes developed in the last few years, companies have tended to put effort into keeping the code portable enough to run on a regular x86 server. The development, training, QA, and testing benefits of being able to run on a regular server are substantial. That means implementing a proper hardware abstraction layer to handle running on a platform which doesn't have the fancy switching silicon. In the 1990s when Junos started, running on x86 was not common practice. We tended to do development on Sparcstations, DECstations, or some other fancy RISC+Unix machine and didn't think much about Intel. The RISC systems were so expensive that one would never outfit a rack of them for QA, it was cheaper to build a bunch of switches instead.

Aside, redux: Junosphere also runs Junos as a virtual machine. In a company the size of Juniper these are likely to have been separate efforts, which might not even have known about each other at first. Nonetheless the timing of the two products is close enough that there may have been some cross-group pollination and shared underpinnings.


Misc Notes

  • The Director communicates with the Interconnects and Nodes via a separate control network, handled by Juniper's previous generation EX4200. This is an example of using a simpler network to bootstrap and control a more complex one.
  • QFX3500 has four QSFPs for 40 gig Ethernet. These can each be broken out into four 10G Ethernet ports, except the first one which supports only three 10G ports. That is fascinating. I wonder what the fourth one does?

Thats all for now. We may return to QFabric as it becomes more widely deployed or as additional details surface.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, November 23, 2011

Unnatural BGP

Last week Martin Casado published some thoughts about using OpenFlow and Software Defined Networking for simple forwarding. That is, does SDN help in distributing shortest path routes for IP prefixes? BGP/OSPF/IS-IS/etc are pretty good for this, with the added benefit of being fully distributed and thoroughly debugged.

The full article is worth a read. The summary (which Martin himself supplied) is "I find it very difficult to argue that SDN has value when it comes to providing simple connectivity." Existing routing protocols are quite good at distributing shortest path prefix routes, the real value of SDN is in handling more complex behaviors.

To expand on this a bit, there have been various efforts over the years to tailor forwarding behavior using more esoteric cost functions. The monetary cost of using a link is a common one to optimize for, as it provides justification for spending on a development effort and also because the business arrangements driving the pricing tend not to distill down to simple weights on a link. Providers may want to keep their customer traffic off of competing networks who are in a position to steal the customer. Transit fees may kick in if a peer delivers significantly more traffic than it receives, providing an incentive to preferentially send traffic through a peer in order to keep the business arrangement equitable. Many of these examples are covered in slides from a course by Jennifer Rexford, who spent several years working on such topics at AT&T Research.

BGP peering between routers at low weight, from each router to controller at high weightUntil quite recently these systems had to be constructed using a standard routing protocol, because that is what the routers would support. BGP is a reasonable choice for this because its interoperability between modern implementations is excellent. The optimization system would peer with the routers, periodically recompute the desired behavior, and export those choices as the best route to destinations. To avoid having the Optimizer be a single point of failure able to bring down the entire network, the routers would retain peering connections with each other at a low weight as a fallback. The fallback routes would never be used so long as the Optimizer routes are present.

This works. It solves real problems. However it is hard to ignore the fact that BGP adds no value in the implementation of the optimization system. Its just an obstacle in the way of getting entries into the forwarding tables of the switch fabric. It also constrains the forwarding behaviors to those which BGP can express, generally some combination of destination address and QoS.

BGP peering between routers, SDN to controllerProduct support for software defined networking is now appearing in the market. These are generally parallel control paths alongside the existing routing protocols. SDN deposits routes into the same forwarding tables as BGP and OSPF, with some priority or precedence mechanism to control arbitration.

By using an SDN protocol these optimization systems are no longer constrained to what BGP can express, they can operate on any information which the hardware supports. Yet even here there is an awkward interaction with the other protocols. Its useful to keep the peering connections with other routers as a fallback in case of controller failure, but they are not well integrated. We can only set precedences between SDN and BGP and hope for the best.

I do wonder if the existing implementation of routing protocols needs a more significant rethink. There is great value in retaining compatibility with the external interfaces: being able to peer with existing BGP/OSPF/etc nodes is a huge benefit. In contrast, there is little value to retaining the internal implementation choices inside the router. The existing protocols could be made to cooperate more flexibly with other inputs. More speculatively, extensions to the protocol itself could label routes which are expected to be overridden by another source, and only present as a fallback path.

Monday, November 14, 2011

The Computer is the Network

Modern commodity switch fabric chips are amazingly capable, but their functionality is not infinite. In particular their parsing engines are generally fixed function, extracting information from the set of headers they were designed to process. Similarly the ability to modify packets is constrained to specifically designed in protocols, not an infinitely programmable rewrite engine.

Software defined networks are a wonderful thing, but development of an SDN agent to drive an existing ASIC does not suddenly make it capable of packet handling it wasn't already designed to do. At best, it might expose functions of which the hardware was always capable but had not been utilized by the older software. Yet even that is questionable: once a platform goes into production, the expertise necessary to thoroughly test and develop bug workarounds for ASIC functionality rapidly disperses to work on new designs. If part of the functionality isn't ready at introduction it is often removed from the documentation and retargeted as a feature of the next chip.


Decisions at the Edge

MPLS networks have an interesting philosophy: the switching elements at the core are conceptually simple, driven by a label stack prepended to the packet. Decisions are made at the edge of the network wherever possible. The core switches may have complex functionality dealing with fast reroutes or congestion management, but they avoid having to re-parse the payloads and make new forwarding decisions.

Ethernet switches have mostly not followed this philosophy, in fact we've essentially followed the opposite path. We've tended to design in features and capacity at the same time. Larger switch fabrics with more capacity also tend to have more features. Initially this happened because a chip with more ports required a larger silicon die to have room for all of the pins. Thus, there was more room for digital logic. Vendors have accentuated this in their marketing plans, omitting software support for features in "low end" edge switches even if they use the same chipset as the more featureful aggregation products.

This leaves software defined networking in a bit of a quandary. The MPLS model is simpler to reason about for large collections of switches, you don't have a combinatorial explosion of decision-making at each hop in the forwarding. Yet non-MPLS Ethernet switches have mostly not evolved in that way, and the edge switches don't have the capability to make all of the decisions for behaviors we might want.


Software Switches to the Rescue

A number of market segments have gradually moved to a model where the first network element to touch the packet is implemented mostly in software. This allows the hope of substantially increasing their capability. A few examples:

vswitch running in the Hypervisor

Datacenters: The first hop is a software switch running in the Hypervisor, like the VMware vSwitch or Cisco Nexus 1000v.

WAN Optimizer with 4 CPUs

Wide Area Networks: WAN optimizers have become quite popular because they save money by reducing the amount of traffic sent over the WAN. These are mostly software products at this point, implementing protocol-specific compression and deduplication. Forthcoming 10 Gig products from Infineta appear to be the first products containing significant amounts of custom hardware.

Wifi AP with CPU, Wifi MAC, and Ethernet MAC

Wifi Access Points: Traditional, thick APs as seen in the consumer and carrier-provided equipment market are a CPU with Ethernet and Wifi, forwarding packets in software.
Thin APs for Enterprise use as deployed by Aruba/Airespace/etc are rather different, the real forwarding happens in hardware back at a central controller.

Cable modem with DOCSIS and Ethernet

Carrier Network Access Units: Like Wifi APs, access gear for DSL and DOCSIS networks is usually a CPU with the appropriate peripherals and forwards frames in software.

Enterprise switch with CPU handling all packets, and a big red X through it

Enterprise: Just kidding, the Enterprise is still firmly in the "more hardware == more better" category. Most of the problems to be solved in Enterprise networking today deal with access control, security, and malware containment. Though CPU forwarding at the edge is one solution to that (attempted by ConSentry and Nevis, among others), the industry mostly settled on out of band approaches.


The Computer is the Network

The Sun Microsystems tagline through most of the 1980s was The Network is the Computer. At the time it referred to client-server computing like NFS and RPC, though the modern web has made this a reality for many people who spend most of their computing time with social and communication applications via the web. Its a shame that Sun itself didn't live to see the day.

We're now entering an era where the Computer is the Network. We don't want to depend upon the end-station itself to mark its packets appropriately, mainly due to security and malware considerations, but we want the flexibility of having software touch every packet. Market segments which provide that capability, like datacenters, WAN connections, and even service providers, are going to be a lot more interesting in the next several years.

Friday, October 28, 2011

Tweetflection Point

Last week at the Web 2.0 Summit in San Francisco, Twitter CEO Dick Costolo talked about recent growth in the service and how iOS5 had caused a sudden 3x jump in signups. He also said daily Tweet volume had reached 250 million. There are many, many estimates of the volume of Tweets sent, but I know of only three which are verifiable as directly from Twitter:

Graphing these on a log scale shows the rate of growth in Tweet volume, roughly tripling in two years almost tripling in one year.

Graph of average daily Tweet volume

This graph is misleading though, as we have so few data points. It is very likely that, like signups for the service, the rate of growth in tweet volume suddenly increased after iOS5 shipped. Lets assume the rate of growth also tripled for the few days after the iOS5 launch, and zoom in on the tail end of the graph. It is quite similar up until a sharp uptick at the end.

Speculative graph of average daily Tweet volume, knee of curve at iOS5 launch.

The reality is somewhere between those two graphs, but likely still steep enough to be terrifying to the engineers involved. iOS5 will absolutely have an impact on the daily volume of Tweets, it would be ludicrous to think otherwise. It probably isn't so abrupt a knee in the curve as shown here, but it has to be substantial. Tweet growth is on a new and steeper slope now. It used to triple in a bit over a year, now it will triple in way less than one year.


Why this matters

Even five months ago, the traffic to carry the Twitter Firehose was becoming a challenge to handle. At that time the average throughput was 35 Mbps, with spikes up to about 138 Mbps. Scaling those numbers to today would be 56 Mbps sustained with spikes to 223 Mbps, and about one year until the spikes exceed a gigabit.

The indications I've seen are that the feed from Twitter is still sent uncompressed. Compressing using gzip (or Snappy) would gain some breathing room, but not solve the underlying problem. The underlying problem is that the volume of data is increasing way, way faster than the capacity of the network and computing elements tasked with handling it. Compression can reduce the absolute number of bits being sent (at the cost of even more CPU), but not reduce the rate of growth.

Fundamentally, there is a limit to how fast a single HTTP stream can go. As described in the post earlier this year, we've scaled network and CPU capacity by going horizontal and spreading load across more elements. Use of a single very fast TCP flow restricts the handling to a single network link and single CPU in a number of places. The network capacity has some headroom still, particularly by throwing money at it in the form of 10G Ethernet links. The capacity of a single CPU core to process the TCP stream is the more serious bottleneck. At some point relatively soon it will be more cost effective to split the Twitter firehose across multiple TCP streams, for easier scaling. The Tweet ID (or a new sequence number) could put tweets back into an absolute order when needed.

Unbalanced link aggregation with a single high speed HTTP firehose.

Update: My math was off. Even before the iOS5 announcement, the rate of growth was nearly tripling in one year. Corrected post.

Monday, October 24, 2011

Well Trodden Technology Paths

Modern CPUs are astonishingly complex, with huge numbers of caches, lookaside buffers, and other features to optimize performance. Hardware reset is generally insufficient to initialize these features for use: reset leaves them in a known state, but not a useful one. Extensive software initialization is required to get all of the subsystems working.

Its quite difficult to design such a complex CPU to handle its own initialization out of reset. The hardware verification wants to focus on the "normal" mode of operation, with all hardware functions active, but handling the boot case requires that a vast number of partially initialized CPU configurations also be verified. Any glitch in these partially initialized states results in a CPU which cannot boot, and is therefore useless.

Large CPU with many cores, and a small 68k CPU in the corner.

Many, and I'd hazard to guess most, complex CPU designs reduce their verification cost and design risk by relying on a far simpler CPU buried within the system to handle the earliest stages of initialization. For example, the Montalvo x86 CPU design contained a small 68000 core to handle many tasks of getting it running. The 68k was an existing, well proven logic design requiring minimal verification by the ASIC team. That small CPU went through the steps to initialize all the various hardware units of the larger CPUs around it before releasing them from reset. It ran an image fetched from flash which could be updated with new code as needed.


Warning: Sudden Segue Ahead

Networking today is at the cusp of a transition, one which other parts of the computing market have already undergone. We're quickly shifting from fixed-function switch fabrics to software defined networks. This shift bears remarkable similarities to the graphics industry shifting from fixed 3D pipelines to GPUs, and of CPUs shedding their coprocessors to focus on delivering more general purpose computing power.

Networking will also face some of the same issues as modern CPUs, where the optimal design for performance in normal operation is not suitable for handling its own control and maintenance. Last week's ruminations about L2 learning are one example: though we can make a case for software provisioning of MAC addresses, the result is a network which doesn't handle topology changes without software stepping in to reprovision.

The control network cannot, itself, seize up whenever there is a problem. The control network has to be robust in handling link failures and topology changes, to allow software to reconfigure the rest of the data-carrying network. This could mean an out of band control network, hooked to the ubiquitous Management ports on enterprise and datacenter switches. It might also be that a VLAN used for control operates in a very different fashion than one used for data, learning L2 addresses dynamically and using MSTP to handle link failures.

All in all, its an exciting time to be in networking.

Sunday, October 23, 2011

Tornado HTTPClient Chunked Downloads

Tornado is an open source web server in Python. It was originally developed to power, and excels at non-blocking operations for real-time web services.

Tornado includes an HTTP client as well, to fetch files from other servers. I found a number of examples of how to use it, but all of them would fetch the entire item and return it in a callback. I plan to fetch some rather large multi-megabyte files, and don't see a reason to hold them entirely in memory. Here is an example of how to get partial updates as the download progresses: pass in a streaming_callback to the HTTPRequest().

The streaming_callback will be called for each chunk of data from the server. 4 KBytes is a common chunk size. The async_callback will be called when the file has been fully fetched; the will be empty


import os
import tempfile
import tornado.httpclient
import tornado.ioloop

class HttpDownload(object):
  def __init__(self, url, ioloop):
    self.ioloop = ioloop
    self.tempfile = tempfile.NamedTemporaryFile(delete=False)
    req = tornado.httpclient.HTTPRequest(
        url = url,
        streaming_callback = self.streaming_callback)
    http_client = tornado.httpclient.AsyncHTTPClient()
    http_client.fetch(req, self.async_callback)

  def streaming_callback(self, data):

  def async_callback(self, response):
    if response.error:
      print "Failed"
      print("Success: %s" %

def main():
  ioloop = tornado.ioloop.IOLoop.instance()
  dl = HttpDownload("", ioloop)

if __name__ == '__main__':

I'm mostly blogging this for my own future use, to be able to find how to do something I remember doing before. There you go, future me.

Wednesday, October 19, 2011

Layer 2 History

Why use L2 networks in datacenters?
Virtual machines need to move from one physical server to another, to balance load. To avoid disrupting service, their IP address cannot change as a result of this move. That means the servers need to be in the same L3 subnet, leading to enormous L2 networks.

Why are enormous L2 networks a problem?
A switch looks up the destination MAC address of the packet it is forwarding. If the switch knows what port that MAC address is on, it sends the packet to that port. If the switch does not know where the MAC address is, it floods the packet to all ports. The amount of flooding traffic tends to rise as the number of stations attached to the L2 network increases.

Transition from half duplexed Ethernet to L2 switching.

Why do L2 switches flood unknown address packets?
So they can learn where that address is. Flooding the packet to all ports means that if that destination exists, it should see the packet and respond. The source address in the response packet lets the switches learn where that address is.

Why do L2 switches need to learn addresses dynamically?
Because they replaced simpler repeaters (often called hubs). Repeaters required no configuration, they just repeated the packet they saw on one segment to all other segments. Requiring extensive configuration of MAC addresses for switches would have been an enormous drawback.

Why did repeaters send packets to all segments?
Repeaters were developed to scale up Ethernet networks. Ethernet at that time mostly used coaxial cable. Once attached to the cable, the station could see all packets from all other stations. Repeaters kept that same property.

How could all stations see all packets?
There were limits placed on the maximum cable length, propagation delay through a repeater, and the number of repeaters in an Ethernet network. The speed of light in the coaxial cable used for the original Ethernet networks is 0.77c, or 77% of the speed of light in a vacuum. Ethernet has a minimum packet size to allow sufficient time for the first bit of the packet to propagate all the way across the topology and back before the packet ends transmission.

So there you go. We build datacenter networks this way because of the speed of light in coaxial cable.

Monday, October 17, 2011

Complexity All the Way Down

Jean-Baptiste Queru recently wrote a brilliant essay titled Dizzying but invisible depth, a description of the sheer, unimaginable complexity at each layer of modern computing infrastructure. It is worth a read.

Wednesday, October 12, 2011

Dennis Ritchie, 1941-2011

Kernighan and Ritchie _The C Programming Language_

K&R C is the finest programming language book ever published. Its terseness is a hallmark of the work of Dennis Ritchie; it says exactly what needs to be said, and nothing more.

Rest in Peace, Dennis Ritchie.

The first generation of computer pioneers are already gone. We're beginning to lose the second generation.

Monday, October 10, 2011

In the last decade we have enjoyed a renaissance of programming language development. Clojure, Scala, Python, C#/F#/et al, Ruby (and Rails), Javascript, node.js, Haskell, Go, and the list goes on. Development of many of those languages started in the 1990s, but adoption accelerated in the 2000s.

Why now? There are probably a lot of reasons, but I want to opine on one.

HTTP is our program linker.

We no longer have to worry about linking to a gazillion libraries written in different languages, with all of the compatibility issues that entails. We no longer build large software systems by linking it all into ginormous binaries, and that loosens a straightjacket which made it difficult to stray too far from C. We dabbled with DCE/CORBA/SunRPC as a way to decouple systems, but RPC marshaling semantics still dragged in a bunch of assumptions about data types.

It took the web and the model of software as a service running on server farms to really decompose large systems into cooperating subsystems which could be implemented any way they like. Facebook can implement chat in Erlang, Akamai can use Clojure, Google can mix C++ with Java/Python/Go/etc. It is all connected together via HTTP, sometimes carrying SOAP or other RPCs, and sometimes with RESTful interfaces even inside the system.

Friday, October 7, 2011

Finding Ada, 2011

Ada Lovelace Day aims to raise the profile of women in science, technology, engineering and maths by encouraging people around the world to talk about the women whose work they admire. This international day of celebration helps people learn about the achievements of women in STEM, inspiring others and creating new role models for young and old alike.

For Ada Lovelace Day 2010 I analyzed a patent for a frequency hopping control system for guided torpedoes, granted to Hedy Lamarr and George Antheil. For Ada Lovelace Day this year I want to share a story from early in my career.

After graduation I worked on ASICs for a few years, mostly on Asynchronous Transfer Mode NICs for Sun workstations. In the 1990s Sun made large investments in ATM: designed its own Segmentation and Reassembly ASICs, wrote a q.2931 signaling stack, adapted NetSNMP as an ILMI stack, wrote Lan Emulation and MPOA implementations, etc.

Yet ATM wasn't a great fit for carrying data traffic. Its overhead for cell headers was very high, it had an unnatural fondness for Sonet as its physical layer, and it required a signaling protocol far more complex than the simple ARP protocol of Ethernet.

Cell loss == packet loss.Its most pernicious problem for data networking was in dealing with congestion. There was no mechanism for flow control, because ATM evolved out of a circuit switched world with predictable traffic patterns. Congestive problems come when you try to switch packets and deal with bursty traffic. In an ATM network the loss of a single cell would render the entire packet unusable, but the network would be further congested carrying the remaining cells of that packet's corpse.

Allyn Romanow at Sun Microsystems and Sally Floyd from the Lawrence Berkeley Labs conducted a series of simulations, ultimately resulting in a paper on how to deal with congestion. If a cell had to be dropped, drop the rest of the cells in that packet. Furthermore, deliberately dropping packets early as buffering approached capacity was even better, and brought ATM links up to the same efficiency for TCP transport as native packet links. Allyn was very generous with her time in explaining the issues and how to solve them, both in ATM congestion control and in a number of other aspects of making a network stable.

ATM also had a very complex signaling stack for setting up connections, so complex that many ATM deployments simply gave up and permanently configured circuits everywhere they needed to go. PVCs only work up to a point, the network size is constrained by the number of available circuits. Renee Danson Sommerfeld took on the task of writing a q.2931 signaling stack for Solaris, requiring painstaking care with specifications and interoperability testing. Sun's ATM products were never reliant on PVCs to operate, they could set up switched circuits on demand and close them when no longer needed.

In this industry we tend to celebrate engineers who spend massive effort putting out fires. What I learned from Allyn, Sally, and Renee is that the truly great engineers see the fire coming, and keep it from spreading in the first place.

Update: Dan McDonald worked at Sun in the same timeframe, and posted his own recollections of working with Allyn, Sally, and Renee. As he put it on Google+, "Good choices for people, poor choice for technology." (i.e. ATM Considered Harmful).

Wednesday, October 5, 2011

Non Uniform Network Access

Four CPUs in a ring, with RAM attached to each.Non Uniform Memory Access is common in modern x86 servers. RAM is connected to each CPU, which connect to each other. Any CPU can access any location in RAM, but will incur additional latency if there are multiple hops along the way. This is the non-uniform part: some portions of memory take longer to access than others.

Yet the NUMA we use today is NUMA in the small. In the 1990s NUMA aimed to make very, very large systems commonplace. There were many levels of bridging, each adding yet more latency. RAM attached to the local CPU was fast, RAM for other CPUs on the same board was somewhat slower. RAM on boards in the same local grouping took longer still, while RAM on the other side of the chassis took forever. Nonetheless this was considered to be a huge advancement in system design because it allowed the software to access vast amounts of memory in the system with a uniform programming interface... except for performance.

Operating system schedulers which had previously run any task on any available CPU would randomly exhibit extremely bad behavior: a process running on distant combinations of CPU and RAM would run an order of magnitude slower. NUMA meant that all RAM was equal, but some was more equal than others. Operating systems desperately added notions of RAM affinity to go along with CPU and cache affinity, but reliably good performance was difficult to achieve.

As an industry we concluded that NUMA in moderation is good, but too much NUMA is bad. Those enormous NUMA systems have mostly lost out to smaller servers clustered together, where each server uses a bit of NUMA to improve its own scalability. The big jump in latency to get to another server is accompanied by a change in API, to use the network instead of memory pointers.


A Segue to Web Applications

Tuning knobs for CPU, Memory, Network.

Modern web applications can make tradeoffs between CPU utilization, memory footprint, and network bandwidth. Increase the amount of memory available for caching, and reduce the CPU required to recalculate results. Shard the data across more nodes to reduce the memory footprint on each at the cost of increasing network bandwidth. In many cases these tradeoffs don't need to be baked deep in the application, they can be tweaked via relatively simple changes. They can be adjusted to tune the application for RAM size, or for the availability of network bandwidth.


Further Segue To Overlay Networks

There is a lot of effort being put into overlay networks for virtualized datacenters, to create an L2 network atop an L3 infrastructure. This allows the infrastructure to run as an L3 network, which we are pretty good at scaling and managing, while the service provided to the VMs behaves as an L2 network.

Yet once the packets are carried in IP tunnels they can, through the magic of routing, be carried across a WAN to another facility. The datacenter network can be transparently extended to include resources in several locations. Transparently, except for performance. The round trip time across a WAN will inevitably be longer than the LAN, the speed of light demands it. Even for geographically close facilities the bandwidth available over a WAN will be far less than the bandwidth available within a datacenter, perhaps orders of magnitude less. Application tuning parameters set based on the performance within a single datacenter will be horribly wrong across the WAN.

I've no doubt that people will do it anyway. We will see L2 overlay networks being carried across VPNs to link datacenters together transparently (except for performance). Like the OS schedulers suddenly finding themselves in a NUMA world, software infrastructure within the datacenter will find itself in a network where some links are more equal than others. As an industry, we'll spend a couple years figuring out whether that was a good idea or not.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, October 2, 2011

NVGRE Musings

It is an interesting time to be involved in datacenter networking. There have been announcements recently of two competing proposals for running virtual L2 networks as an overlay atop a underlying IP network, VXLAN and NVGRE. Supporting an L2 service is important for virtualized servers, which need to be able to move from one physical server to another without changing their IP address or interrupting the services they provide. Having written about VXLAN in a series of three posts, now it is time for NVGRE. Ivan Pepelnjak has already posted about it on IOShints, which I recommend reading.

NVGRE encapsulates L2 frames inside tunnels to carry them across an L3 network. As its name implies, it uses GRE tunneling. GRE has been around for a very long time, and is well supported by networking gear and analysis tools. An NVGRE Endpoint uses the Key field in the GRE header to hold the Tenant Network Identifier (TNI), a 24 bit space of virtual LANs.

Outer MAC, Outer IP, GRE, Inner MAC, Inner Payload, Outer FCS.

The encapsulated packet has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the NVGRE Endpoint is likely to be a software component within the server, prior to hitting any NIC, the frames have no CRC. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

The NVGRE draft refers to IP addresses in the outer header as Provider Addresses, and the inner header as Customer Addresses. NVGRE can optionally also use an IP multicast group for each TNI to distribute L2 broadcast and multicast packets.


Not Quite Done

As befits its "draft" designation, a number of details in the NVGRE proposal are left to be determined in future iterations. One largish bit left unspecified is mapping of Customer Addresses to Provider. When an NVGRE Endpoint needs to send a packet to a remote VM, it must know the address of the remote NVGRE Endpoint. The mechanism to maintain this mapping is not yet defined, though it will be provisioned by a control function communicating with the Hypervisors and switches.


Optional Multicast?

The NVGRE draft calls out broadcast and multicast support as being optional, only if the network operator chooses to support it. To operate as a virtual Ethernet network a few broadcast protocols are essential, like ARP and IPv6 ND. Presumably if broadcast is not available, the NVGRE Endpoint would respond to these requests to its local VMs.

Yet I don't see how that can work in all cases. The NVGRE control plane can certainly know the Provider Address of all NVGRE Endpoints. It can know the MAC address of all guest VMs within the tenant network, because the Hypervisor provides the MAC address as part of the virtual hardware platform. There are notable exceptions where guest VMs use VRRP, or make up locally administered MAC addresses, but I'll ignore those for now.

I don't see how an NVGRE Endpoint can know all Customer IP Addresses. One of two things would have to happen:

  • Require all customer VMs to obtain their IP from the provider. Even backend systems using private, internal addresses would have to get them from the datacenter operator so that NVGRE can know where they are.
  • Implement a distributed learning function where NVGRE Endpoints watch for new IP addresses sent by their VMs and report them to all other Endpoints.

The current draft of NVGRE makes no mention of either such function, so we'll have to watch for future developments.

The earlier VL2 network also did not require multicast and handled ARP via a network-wide directory service. Many VL2 concepts made their way into NVGRE. So far as I understand it, VL2 assigned all IP addresses to VMs and could know where they were in the network.



Load balancing across four links between switches.An important topic for tunneling protocols is multipathing. When multiple paths are available to a destination, either LACP at L2 or ECMP at L3, the switches have to choose which link to use. It is important that packets on the same flow stay in order, as protocols like TCP use excessive reordering as an indication of congestion. Switches hash packet headers to select a link, so packets with the same headers will always choose the same link.

Tunneling protocols have issues with this type of hashing: all packets in the tunnel have the same header. This limits them to a single link, and congests that one link for other traffic. Some switch chips implement extra support for common tunnels like GRE, to include the Inner header in the hash computation. NVGRE would benefit greatly from this support. Unfortunately, it is not universal amongst modern switches.

Choosing Provider Address by hashing the Inner headers.The NVGRE draft proposes that each NVGRE Endpoint have multiple Provider Addresses. The Endpoints can choose one of several source and destination IP addresses in the encapsulating IP header, to provide variance to spread load across LACP and ECMP links. The draft says that when the Endpoint has multiple PAs, each Customer Address will be provisioned to use one of them. In practice I suspect it would be better were the NVGRE Endpoint to hash the Inner headers to choose addresses, and distribute the load for each Customer Address across all links.

Using multiple IP addresses for load balancing is clever, but I can't easily predict how well it will work. The number of different flows the switches see will be relatively small. For example if each endpoint has four addresses, the total number of different header combinations between any two endpoints is sixteen. This is sixteen times better than having a single address each, but it is still not a lot. Unbalanced link utilization seems quite possible.


Aside: Deliberate Multipathing

One LACP group feeding in to the next.The relatively limited variance in headers leads to an obvious next step: ensure the traffic will be balanced by predicting what the switch will do, and choose Provider IP addresses to optimize and ensure it is well balanced. In networking today we tend to solve problems by making the edges smarter.

The NVGRE draft says that selection of a Provider Address is provisioned to the Endpoint. Each Customer Address will be associated with exactly one Provider Address to use. I suspect that selection of Provider Addresses is expected to be done via an optimization mechanism like this, but I'm definitely speculating.

I'd caution that this is harder than it sounds. Switches use the ingress port as part of the hash calculation. That is, the same packet arriving on a different ingress port will choose a different egress link within the LACP/ECMP group. To predict behavior one needs a complete wiring diagram of the network. In the rather common case where several LACP/ECMP groups are traversed along the way to a destination, the link selected by each previous switch influences the hash computation of the next.


Misc Notes

  • The NVGRE draft mentions keeping an MTU state per Endpoint, to avoid fragmentation. Details will be described in future drafts. NVGRE certainly benefits from a datacenter network with a larger MTU, but will not require it.
  • VXLAN describes its overlay network as existing within a datacenter. NVGRE explicitly calls for spanning across wide area networks via VPNs, for example to connect a corporate datacenter to additional resources in a cloud provider. I'll have to cover this aspect in another post, this post is too long already.



Its quite difficult to draw a conclusion about NVGRE, as so much is still unspecified. There are two relatively crucial mapping functions which have yet to be described:

  • When a VM wants to contact a remote Customer IP and sends an ARP Request, in the absence of multicast, how can the matching MAC address be known?
  • When the NVGRE Endpoint is handed a frame destined to a remote Customer MAC, how does it find the Provider Address of the remote Endpoint?

So we'll wait and see.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Monday, September 26, 2011

Peer Review Terminology

Many companies now include peer feedback as part of the annual review process. Each employee nominates several of their colleagues to write a review of their work from the previous year. If you are new to this process, here is some terminology which may be useful.

[peer-ree-heel-yun] (noun)
The point at which 50% of requested peer reviews are complete.
[peer-swey-zhuhn] (noun)
A particularly glowing peer review.
[peer-juh-ree] (noun)
An astonishingly glowing peer review.
[peer-plekst] (adjective)
What exactly did they work on, anyway?
[peer-rash-uh-nl] (adjective)
Why am I reviewing this person?
[peer-guh-tohr-ee] (noun)
The set of peer reviews which may have to be declined due to lack of time.
[peer-fek-shuh-nist] (noun)
I spent a long time obsessing over wording.
[peer-ee-od-ik] (adjective)
Maybe I'll just copy some of what I wrote last year.
[peer-ree-tay-shun] (noun)
Unreasonable hostility felt toward the subject of the last peer review left to be written.
[peer-seh-kyoot] (verb)
How nice, everyone asked for my review.
peersona non grata
[peer-soh-nah nohn grah-tah] (noun)
Nobody asked for my review?
[peer-reg-gyu-ler] (adjective)
An incomplete peer review, submitted anyway, just before the deadline.

Sunday, September 18, 2011

VXLAN Conclusion

This is the third and final article in a series about VXLAN. I recommend reading the first and second articles before this one. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.


Foreign Gateways

Though I've consistently described VXLAN communications as occurring between VMs, many datacenters have a mix of virtual servers with single-instance physical servers. Something has to provide the VTEP function for all nodes on the network, but it doesn't have to be the server itself. A Gateway function can bridge to physical L2 networks, and with representatives of several switch companies as authors of the RFC this seems likely to materialize within the networking gear itself. The Gateway can also be provided by a server sitting within the same L2 domain as the servers it handles.

Gateway to communicate with other physical servers on an L2 segment.

Even if the datacenter consists entirely of VMs, a Gateway function is still needed in the switch. To communicate with the Internet (or anything else outside of their subnet) the VMs will ARP for their next hop router. This router has to have a VTEP.


Transition Strategy

Mixture of VTEP-enabled servers with non requires a gateway function somewhereI'm tempted to say there isn't a transition strategy. Thats a bit too harsh in that the Gateway function just described can serve as a proxy, but its not far from the mark. As described in the RFC, the VTEP assumes that all destination L2 addresses will be served by a remote VTEP somewhere. If the VTEP doesn't know the L3 address of the remote node to send to, it floods the packet to all VTEPs using multicast. There is no provision for direct L2 communication to nodes which have no VTEP. It is assumed that an existing installation of VMs on a VLAN will be taken out of service, and all nodes reconfigured to use VXLAN. VLANs can be converted individually, but there is no provision operation with a mixed set of VTEP-enabled and non-VTEP-enabled nodes on an existing VLAN.

For an existing datacenter which desires to avoid scheduling downtime for an entire VLAN, one transition strategy would use a VTEP Gateway as the first step. When the first server is upgraded to use VXLAN and have its own VTEP, all of its packets to other servers will go through this VTEP Gateway. As additional servers are upgraded they will begin communicating directly between VTEPs, and rely on the Gateway to maintain communication with the rest of their subnet.

Where would the Gateway function go? During the transition, which could be lengthy, the Gateway VTEP will be absolutely essential for operation. It shouldn't be a single point of failure, and this should trigger the network engineer's spidey sense about adding a new critical piece of infrastructure. It will need to be monitored, people will need to be trained in what to do if it fails, etc. Therefore it seems far more likely that customers will choose to upgrade their switches to include the VTEP Gateway function, so as not to add a new critical bit of infrastructure.


Controller to the Rescue?

Mixture of VTEP-enabled servers with non requires a gateway function somewhereWhat makes this transition strategy difficult to accept is that VMs have to be configured to be part of a VXLAN. They have to be assigned to a particular VNI, and that VNI has to be given an IP multicast address to use for flooding. Therefore something, somewhere knows the complete list of VMs which should be part of the VXLAN. In Rumsfeldian terms, there are only known unknown addresses and no unknown unknowns. That is, the VTEP can know the complete list of destination MAC addresses it is supposed to be able to reach via VXLAN. The only unknown is the L3 address of the remote VTEP. If the VTEP encounters a destination MAC address which it doesn't know about, it doesn't have to assume it is attached to a VTEP somewhere. It could know that some MAC addresses are reached directly, without VXLAN encapsulation.

The previous article in this series brought up the reliance on multicast for learning as an issue, and suggested that a VXLAN controller would be an important product to offer. That controller could also provide a better transition strategy, allowing VTEPs to know that some L2 addresses should be sent directly to the wire without a VXLAN tunnel. This doesn't make the controller part of the dataplane: it is only involved when stations are added or removed from the VXLAN. During normal forwarding, the controller is not involved.

It is safe to say that the transition strategy for existing, brownfield datacenter networks is the part of the VXLAN proposal which I like the least.


Other miscellaneous notes

VXLAN prepends 42 bytes of headers to the original packet. To avoid IP fragmentation the L3 network needs to handle a slightly larger frame size than standard Ethernet. Support for Jumbo frames is almost universal in networking gear at this point, this should not be an issue.

There is only a single multicast group per VNI. All broadcast and multicast frames in that VXLAN will be sent to that one IP multicast group and delivered to all VTEPs. The VTEP would likely run an IGMP Snooping function locally to determine whether to deliver multicast frames to its VMs. VXLAN as currently defined can't prune the delivery tree, all VTEPs must receive all frames. It would be nice to be able to prune delivery within the network, and not deliver to VTEPs which have no subscribing VMs. This would require multiple IP multicast groups per VNI, which would complicate the proposal.



I like the VXLAN proposal. I view the trend toward having enormous L2 networks in datacenters as disturbing, and see VXLAN as a way to give the VMs the network they want without tying it to the underlying physical infrastructure. It virtualizes the network to meet the needs of the virtual servers.

After beginning to publish these articles on VXLAN I became aware of another proposal, NVGRE. There appear to be some similarities, including the use of IP multicast to fan out L2 broadcast/multicast frames, and the two proposals even share an author in common. NVGRE uses GRE encapsulation instead of the UDP+VXLAN header, with multiple L2 addresses to provide load balancing across LACP/ECMP links. It will take a while to digest, but I expect to write some thoughts about NVGRE in the future.

Many thanks to Ken Duda, whose patient explanations of VXLAN on Google+ made this writeup possible.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Saturday, September 17, 2011

VXLAN Part Deux

This is the second of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

I strongly recommend reading the first article before this one, to provide background.

UDP Encapsulation

In addition to the IP tunnel header and a VXLAN header, there is also an Outer UDP header. One might reasonably ask why it is there, as VXLAN could have been directly encapsulated within IP.

Outer MAC, Outer IP, UDP, VXLAN, Inner MAC, Inner Payload, Outer FCS.

Four paths between routers, hashing headers chooses one.The UDP header serves an interesting purpose, it isn't there to perform the multiplexing role UDP normally serves. When switches have multiple paths available to a destination, whether an L2 trunk or L3 multipathing, the specific link is chosen by hashing packet headers. Most switch hardware is quite limited in how it computes the hash: the outermost L2/L3/L4 headers. Some chips can examine the inner headers of long-established tunneling protocols like GRE/MAC-in-MAC/IP-in-IP. For a new protocol like VXLAN, it would take years for silicon support for the inner headers to become common.

Therefore the VTEP calculates a hash of the inner packet headers, and places it in the source UDP port where it feeds into LACP/ECMP hash calculation. Existing switch chips get proper load balancing using only the Outer L2/L3/L4 headers, at the cost of 8 bytes of overhead.

VTEP calculates hash of inner packet headers, places it in the UDP source port.

New protocols sometimes encapsulate themselves inside UDP headers to more easily traverse firewalls and NAT devices. That isn't what VXLAN is doing, it would be somewhat ludicrous to put firewalls between subnets within a datacenter. In fact, the way VXLAN uses its UDP header can make firewall traversal a bit more challenging. The inner packet headers can hash to a well known UDP port number like 53, making it look like a DNS response, but a firewall attempting to inspect the contents of the frame will not find a valid DNS packet. It would be important to disable any deep packet inspection for packets traveling between VTEP endpoints. If VXLAN is used to extend an L2 network all the way across a WAN the firewall question becomes more interesting. I don't think its a good idea to have a VXLAN cross a WAN, but that will have to be a topic for another day.


VTEP Learning

VTEP Table of MAC:OuterIP mappings.The VTEP examines the destination MAC address of frames it handles, looking up the IP address of the VTEP for that destination. This MAC:OuterIP mapping table is populated by learning, very much like an L2 switch discovers the port mappings for MAC addresses. When a VM wishes to communicate with another VM it generally first sends a broadcast ARP, which its VTEP will send to the multicast group for its VNI. All of the other VTEPs will learn the Inner MAC address of the sending VM and Outer IP address of its VTEP from this packet. The destination VM will respond to the ARP via a unicast message back to the sender, which allows the original VTEP to learn the destination mapping as well.

When a MAC address moves, the other VTEPs find its new location by the same learning process, using the first packet they see from its new VTEP. Why might a MAC address move? Consider a protocol like VRRP, which fails over a MAC address between two redundant servers. When ownership of a VRRP MAC address switches from one VM to another, all of the other VTEPs on the network need to learn the new MAC:OuterIP association. VRRP typically sends a gratuitous ARP when it fails over, and as a broadcast packet that ARP will be sent to all VTEPs. They learn the new MAC:OuterIP association from that packet.

VRRP nicely sends a gratuitous ARP when the MAC address moves, but not all MAC moves will do so. Consider the case where a running VM is frozen and moved to another server. The VM will resume where it left off, its ARP table fully populated for nodes it is communicating with. It won't send a gratuitous ARP because the VM has no idea that it has moved to a new vserver, and it won't send ARPs for addresses already in its table either. Its possible I've missed some subtlety, but I don't see how remote VTEPs would quickly learn the new location of the MAC address. I think they continue sending to the incorrect VTEP until their entries time out, and then they start flooding to the VXLAN multicast address.

Multicast frame delivered to 3 VTEPs but dropped before reaching one.Though it is appealing to let VTEPs track each other automatically using multicast and learning, I suspect beyond a certain scale of network that isn't going to work very well. Multicast frames are not reliably delivered, and because they fan out to all nodes they tend to become ever less reliable as the number of nodes increases. The RFC mentions the possibility of other mechanisms to populate the VTEP tables, including centralized controllers. I suspect a controller will be an important product to offer. Troubleshooting why subsets of VMs transiently lose the ability to communicate after a move or failover would be really annoying. Small networks could rely on multicast, while larger networks could fall back to it if the controller fails.



On the off chance that people read this far, I'll offer a couple suggestions for modifications to the VXLAN specification based on discussion earlier in the article.

  • If VXLAN is used to connect remote facilities, it is likely to traverse firewalls. When the VTEP calculates a hash of the Inner headers to place in the UDP source port field, I'd recommend it always set the most significant bit. This restricts the hash to 15 bits, values 32768 - 65535, but avoids any low numbered port number with a defined meaning like DNS. This should still result in good LACP/ECMP hashing, as this makes VXLAN packets look like ephemeral ports used by UDP client applications.
  • When a VTEP sees a new source MAC address from a local VM, flood the packet even if the OuterIP of the destination is already known. This gives remote VTEPs a better chance of noticing a MAC move. The VTEP already had to keep track of local MAC addresses to properly deliver received frames, so I suspect there is already a local source learning function.

Next article: A few final VXLAN topics.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.