Coding Relic: Ethernet

Showing posts with label Ethernet. Show all posts

Friday, January 13, 2012

Merchant Silicon

Last week Greg Ferro wrote about the use of merchant silicon in networking products. I'd like to share some thoughts on the topic.

Chip cost

Fistful of dollars We know the story by now: by selling chips into products from multiple networking companies, commodity chips sell in large volume and benefit from larger discounts. This is a compelling factor in the low end of the switching market, where margins are thin and a primary selling point is price.

Yet low price of the switch silicon is not a decisive factor in the midrange and high end of the switching market, where products are more featureful and sell at higher prices. The price of those products is not based on the cost of materials, its based on what the market will bear. The market has traditionally borne a lot: Cisco's profit margins in these segments have been legendary for a decade.

In my experience, chip price was not a decisive factor in the wholesale move to merchant silicon.

Non Recurring Engineering (NRE cost)

Silicon chip Say it costs $10 million to design a chipset for a high end switch, and the resulting set of silicon costs $500 in the expected volumes. If that high end switch sells 10,000 units in its first year then the NRE cost for developing it amounts to $1,000 per unit, double the cost of buying the chip itself. The longer the model remains in production the more its cost can be amortized... but the company has to pay the complete cost to develop the silicon before the first unit is sold.

In the midrange and high end switch markets, the strongest pitch made by merchant semiconductor suppliers wasn't the per-chip cost. A stronger pitch for those segments was elimination of NRE. The networking company didn't have to bear the cost of chip development up front. The company did pay the cost of development, but it would be factored into the unit price and pay-as-you-go rather than upfront.

Yet even NRE savings wasn't usually enough to convince a networking company to give up its own ASIC designs. Most realized that to do so was to give up a substantial portion of their ability to differentiate their products. Several vendors adopted a hybrid approach. They used merchant silicon to provide the fabric and handle simple cases of packet forwarding, and configured flow rules to steer interesting packets out to a separate device for additional handling. Costs were reduced by only having to design that specialized chip for a subset of the total traffic through the box, but they retained an ability to differentiate features.

In my experience, eliminating the burden of NRE was not a decisive factor in the move to merchant silicon.

Schedule

Gantt chart The merchant silicon vendors of the world can dedicate more ASIC engineers to their projects. This isn't as big a win as it sounds: tripling the size of the design team does not result in a chip with 3x the features or in 1/3rd the time. As with software projects (see The Mythical Man Month), the increasing coordination overhead of a larger team results in steeply diminishing returns.

Instead, merchant silicon vendors have the luxury of working on multiple projects in parallel. They can have two teams leapfrogging each other, each working on a multiyear timeline and introducing their products in interleaving years. Alternately, they can target different chips at different market segments. They rely on their SDK to hide gratuitous differences which they happened to introduce, and only make their customers deal with the truly differentiating features of the different chips.

It is difficult to make a case to spend two years to develop custom silicon for a product when merchant silicon with sufficient features is expected to be available a year earlier. Merchant silicon suppliers share details of their roadmap very early, even before the feature set is finalized. This lets them incorporate feedback into the features for the final product, but they also do it to derail in-house silicon efforts.

Yet in my experience at least, though schedule is a decisive factor, this isn't the full story.

Misaligned Incentives

When leading a chip development effort, the biggest fear is not that the chip will have bugs. Many ASIC bugs can be worked around in software.

The biggest fear is not that the chip will be larger and more costly than planned. That is a negotiation issue with the silicon fab, or a business issue in positioning the product.

The biggest fear is that the chip will be late. Missing the market window is the worst kind of failure for an ASIC. The design team produces a chip which meets all requirements, but comes at a time when the market no longer cares. The tardy product will face significant pricing pressure on the day of introduction, more so the longer competitive products have been available.

The technical leadership of an internal ASIC project is therefore incented to plan a schedule which they are sure they can meet. They'll use realistic timelines for the different phases of the product, and include sufficient padding to handle unexpected problems. They will produce best case timelines as well, but those tend to be discounted by the project leadership as unrealistic.

The technical leadership inside merchant semiconductor companies face the same issues, and produce the same sort of schedule which they are confident they can meet. The difference is, that conservative schedule is not handed out to the decision makers at the customer networking vendors. A more optimistic schedule is maintained and presented to customers - not rosy best case, but certainly optimistic. Everybody knows that schedule will slip, even the customers themselves... but nonetheless it works. Customers work from the optimistic schedule because that is all they have. It increases the difference in schedule between in-house and merchant options by several quarters.

The Point of No Return

ASIC design requires some rather specialized skill sets. There is a great deal of similarity between chip design and software design, but not so much that one can switch freely back and forth. If there is not an active chip development effort underway, the ASIC team tends to run out of interesting things to work on.

When a company begins seriously contemplating building their high end products using merchant silicon, even if the management tries to keep it low key, it becomes pretty obvious internally. You have to pull in senior technical folks from the software, hardware, and ASIC teams to help with the evaluation. News spreads. Gossip spreads faster. If the ASIC team becomes convinced that there will be no further chip projects, they start to move on.

It can easily become a self-fulfilling prophecy: serious consideration of a move to merchant silicon leads to loss of the capability to develop custom ASICs.

Why it Matters

We talk a lot about Software Defined Networking. The term, consciously or not, tends to make people think the networking is all in software and the hardware is insignificant. That isn't actually true, as the SDN can only utilize actions which the hardware can actually do, but it illustrates how much less we value we put in hardware now.

In the context of SDN, reducing switch hardware diversity is actually a good thing. It results in a more uniform set of capabilities in networks, and a smaller set of cases for the SDN controller to have to handle. Networking used to be dominated by the hardware designs, but it has moved on now. I think that is a good thing.

Wednesday, December 7, 2011

Requiem for Jumbo Frames

This weekend Greg Ferro published an article about jumbo frames. He points to recent measurements showing no real benefit with large frames. Some years ago I worked on NIC designs, and at the time we talked about Jumbo frames a lot. It was always a tradeoff: improve performance by sacrificing compatibility, or live with the performance until hardware designs could make the 1500 byte MTU be as efficient as jumbo frames. The latter school of thought won out, and they delivered on it. Jumbo frames no longer offer a significant performance advantage.

Roughly speaking, software overhead for a networking protocol stack can be divided into two chunks:

Per-byte which increases with each byte of data sent. Data copies, encryption, checksums, etc make up this kind of overhead.
Per-packet which increases with each packet regardless of how big the packet is. Interrupts, socket buffer manipulation, protocol control block lookups, and context switches are examples of this kind of overhead.

Wayback machine to 1992

I'm going to talk about the evolution of operating systems and NICs starting from the 1990s, but will focus on Unix systems. DOS and MacOS 6.x were far more common back then, but modern operating systems evolved more similarly to Unix than to those environments.

Address spaces in user space, kernel, and NIC hardware Lets consider a typical processing path for sending a packet in a Unix system in the early 1990s:

Application calls write(). System copies a chunk of data into the kernel, to mbufs/mblks/etc.
Kernel buffers handed to TCP/IP stack, which looks up the protocol control block (PCB) for the socket.
Stack calculates a TCP checksum and populates the TCP, IP, and Ethernet headers.
Ethernet driver copies kernel buffers out to the hardware. Programmed I/O using the CPU to copy was quite common in 1992.
Hardware interrupts when the transmission is complete, allowing the driver to send another packet.

Altogether the data was copied two and a half times: from user space to kernel, from kernel to NIC, plus a pass over the data to calculate the TCP checksum. There were additionally per packet overheads in looking up the PCB, populating headers, and handling interrupts.

The receive path was similar, with a NIC interrupt kicking off processing of each packet and two and a half copies up to the receiving application. There was more per-packet overhead for receive: where transmit could look up the PCB once and process a sizable chunk of data from the application in one swoop, RX always gets one packet at a time.

Jumbo frames were a performance advantage in this timeframe, but not a huge one. Larger frames reduced the per-packet overhead, but the per-byte overheads were significant enough to dominate the performance numbers.

Wayback Machine to 1999

An early optimization was elimination of the separate pass over the data for the TCP checksum. It could be folded into one of the data copies, and NICs also quickly added hardware support. [Aside: the separate copy and checksum passes in 4.4BSD allowed years of academic papers to be written, putting whatever cruft they liked into the protocol, yet still portraying it as a performance improvement by incidentally folding the checksum into a copy.] NICs also evolved to be DMA devices; the memory subsystem still had to bear the overhead of the copy to hardware, but the CPU load was alleviated. Finally, operating systems got smarter about leaving gaps for headers when copying data into the kernel, eliminating a bunch of memory allocation overhead to hold the TCP/IP/Ethernet headers.

Packet size vs throughput in 2000, 2.5x for 9180 byte vs 1500 I have data on packet size versus throughput in this timeframe, collected in the last months of 2000. It was gathered for a presentation at LCN 2000. It used an OC-12 ATM interface, where LAN emulation allowed MTUs up to 18 KBytes. I had to find an old system to run these, the modern systems of the time could almost max out the OC-12 link with 1500 byte packets. I recall it being a Sparcstation-20. The ATM NIC supported TCP checksums in hardware and used DMA.

Roughly the year 1999 was the peak of when jumbo frames would have been most beneficial. Considerable work had been done by that point to reduce per-byte overheads, eliminating the separate checksumming pass and offloading data movement from the CPU. Some work had been done to reduce the per-packet overhead, but not as much. After 1999 additional hardware focussed on reducing the per-packet overhead, and jumbo frames gradually became less of a win.

LSO/LRO

Protocol stack handing a chunk of data to NIC Large Segment Offload (LSO), referred to as TCP Segmentation Offload (TSO) in Linux circles, is a technique to copy a large chunk of data from the application process and hand it as-is to the NIC. The protocol stack generates a single set of Ethernet+TCP+IP header to use as a template, and the NIC handles the details of incrementing the sequence number and calculating fresh checksums for a new header prepended to each packet. Chunks of 32K and 64K are common, so the NIC transmits 21 or 42 TCP segments without further intervention from the protocol stack.

The interesting thing about LSO and Jumbo frames is that Jumbo frames no longer make a difference. The CPU only gets involved for every large chunk of data, the overhead is the same whether that chunk turns into 1500 byte or 9000 byte packets on the wire. The main impact of the frame size is the number of ACKs coming back, as most TCP implementations generate an ACK for every other frame. Transmitting jumbo frames would reduce the number of ACKs, but that kind of overhead is below the noise floor. We just don't care.

There is a similar technique for received packets called, imaginatively enough, Large Receive Offload (LRO). For LSO the NIC and protocol software are in control of when data is sent. For LRO, packets just arrive whenever they arrive. The NIC has to gather packets from each flow to hand up in a chunk. Its quite a bit more complex, and doesn't tend to work as well as LSO. As modern web application servers tend to send far more data than they receive, LSO has been of much greater importance than LRO.

Large Segment Offload mostly removed the justification for jumbo frames. Nonetheless support for larger frame sizes is almost universal in modern networking gear, and customers who were already using jumbo frames have generally carried on using them. Moderately larger frame support is also helpful for carriers who want to encapsulate customer traffic within their own headers. I expect hardware designs to continue to accommodate it.

TCP Calcification

There has been a big downside of pervasive use of LSO: it has become the immune response preventing changes in protocols. NIC designs vary widely in their implementation of the technique, and some of them are very rigid. Here "rigid" is a euphemism for "mostly crap." There are NICs which hard-code how to handle protocols as they existed in the early part of this century: Ethernet header, optional VLAN header, IPv4/IPv6, TCP. Add any new option, or any new header, and some portion of existing NICs will not cope with it. Making changes to existing protocols or adding new headers is vastly harder now, as changes are likely to throw the protocol back into the slow zone and render moot any of the benefits it brings.

It used to be that any new TCP extension had to carefully negotiate between sender and receiver in the SYN/SYN+ACK to make sure both sides would support an option. Nowadays due to LSO and to the pervasive use of middleboxes, we basically cannot add options to TCP at all.

I guess the moral is, "be careful what you wish for."

Monday, November 28, 2011

QFabric Followup

In August this site published a series of posts about the Juniper QFabric. Since then Juniper has released hardware documentation for the QFabric components, so its time for a followup.

QF edge Nodes, Interconnects, and Directors QFabric consists of Nodes at the edges wired to large Interconnect switches in the core. The whole collection is monitored and managed by out of band Directors. Juniper emphasizes that the QFabric should be thought of as a single distributed switch, not as a network of individual switches. The entire QFabric is managed as one entity.

Control header prepended to frame The fundamental distinction between QFabric and conventional switches is in the forwarding decision. In a conventional switch topology each layer of switching looks at the L2/L3 headers to figure out what to do. The edge switch sends the packet to the distribution switch, which examines the headers again before sending the packet on towards the core (which examines the headers again). QFabric does not work this way. QFabric functions much more like the collection of switch chips inside a modular chassis: the forwarding decision is made by the ingress switch and is conveyed through the rest of the fabric by prepending control headers. The Interconnect and egress Node forward the packet according to its control header, not via another set of L2/L3 lookups.

Node Groups

The Hardware Documentation describes two kinds of Node Groups, Server and Network, which gather multiple edge Nodes together for common purposes.

Server Node Groups are straightforward: normally the edge Nodes are independent, connecting servers and storage to the fabric. Pairs of edge switches can be configured as Server Node Groups for redundancy, allowing LAG groups to span the two switches.
Network Node Groups configure up to eight edge Nodes to interconnect with remote networks. Routing protocols like BGP or OSPF run on the Director systems, so the entire Group shares a common Routing Information Base and other data.

Why have Groups? Its somewhat easier to understand the purpose of the Network Node Group: routing processes have to be spun up on the Directors, and perhaps those processes have to point to some distinct entity to operate with. Why have Server Node Groups, though? Redundant server connections are certainly beneficial, but why require an additional fabric configuration to allow it?

Ingress fanout to four LAG member ports I don't know the answer, but I suspect it has to do with Link Aggregation (LAG). Server Node Groups allow a LAG to be configured using ports spanning the two Nodes. In a chassis switch, LAG is handled by the ingress chip. It looks up the destination address to find the destination port. Every chip knows the membership of all LAGs in the chassis. The ingress chip computes a hash of the packet to pick which LAG member port to send the packet to. This is how LAG member ports can be on different line cards, the ingress port sends it to the correct card.

Ingress fanout to four LAG member ports The downside of implementing LAG at ingress is that every chip has to know the membership of all LAGs in the system. Whenever a LAG member port goes down, all chips have to be updated to stop using it. With QFabric, where ingress chips are distributed across a network and the largest fabric could have thousands of server LAG connections, updating all of the Nodes whenever a link goes down could take a really long time. LAG failure is supposed to be quick, with minimal packet loss when a link fails. Therefore I wonder if Juniper has implemented LAG a bit differently, perhaps by handling member port selection in the Interconnect, in order to minimize the time to handle a member port failure.

I feel compelled to emphasize again: I'm making this up. I don't know how QFabric is implemented nor why Juniper made the choices they made. Its just fun to speculate.

Virtualized Junos

Regarding the Director software, the Hardware Documentation says, "[Director devices] run the Junos operating system (Junos OS) on top of a CentOS foundation." Now that is an interesting choice. Way, way back in the mists of time, Junos started from NetBSD as its base OS. NetBSD is still a viable project and runs on modern x86 machines, yet Juniper chose to hoist Junos atop a Linux base instead.

I suspect that in the intervening time, the Junos kernel and platform support diverged so far from NetBSD development that it became impractical to integrate recent work from the public project. Juniper would have faced a substantial effort to handle modern x86 hardware, and chose instead to virtualize the Junos kernel in a VM whose hardware was easier to support. I'll bet the CentOS on the Director is the host for a Xen hypervisor.

Update: in the comments, Brandon Bennett and Julien Goodwin both note that Junos used FreeBSD as its base OS, not NetBSD.

Aside: with network OSes developed in the last few years, companies have tended to put effort into keeping the code portable enough to run on a regular x86 server. The development, training, QA, and testing benefits of being able to run on a regular server are substantial. That means implementing a proper hardware abstraction layer to handle running on a platform which doesn't have the fancy switching silicon. In the 1990s when Junos started, running on x86 was not common practice. We tended to do development on Sparcstations, DECstations, or some other fancy RISC+Unix machine and didn't think much about Intel. The RISC systems were so expensive that one would never outfit a rack of them for QA, it was cheaper to build a bunch of switches instead.

Aside, redux: Junosphere also runs Junos as a virtual machine. In a company the size of Juniper these are likely to have been separate efforts, which might not even have known about each other at first. Nonetheless the timing of the two products is close enough that there may have been some cross-group pollination and shared underpinnings.

Misc Notes

The Director communicates with the Interconnects and Nodes via a separate control network, handled by Juniper's previous generation EX4200. This is an example of using a simpler network to bootstrap and control a more complex one.
QFX3500 has four QSFPs for 40 gig Ethernet. These can each be broken out into four 10G Ethernet ports, except the first one which supports only three 10G ports. That is fascinating. I wonder what the fourth one does?

Thats all for now. We may return to QFabric as it becomes more widely deployed or as additional details surface.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, November 23, 2011

Unnatural BGP

Last week Martin Casado published some thoughts about using OpenFlow and Software Defined Networking for simple forwarding. That is, does SDN help in distributing shortest path routes for IP prefixes? BGP/OSPF/IS-IS/etc are pretty good for this, with the added benefit of being fully distributed and thoroughly debugged.

The full article is worth a read. The summary (which Martin himself supplied) is "I find it very difficult to argue that SDN has value when it comes to providing simple connectivity." Existing routing protocols are quite good at distributing shortest path prefix routes, the real value of SDN is in handling more complex behaviors.

To expand on this a bit, there have been various efforts over the years to tailor forwarding behavior using more esoteric cost functions. The monetary cost of using a link is a common one to optimize for, as it provides justification for spending on a development effort and also because the business arrangements driving the pricing tend not to distill down to simple weights on a link. Providers may want to keep their customer traffic off of competing networks who are in a position to steal the customer. Transit fees may kick in if a peer delivers significantly more traffic than it receives, providing an incentive to preferentially send traffic through a peer in order to keep the business arrangement equitable. Many of these examples are covered in slides from a course by Jennifer Rexford, who spent several years working on such topics at AT&T Research.

BGP peering between routers at low weight, from each router to controller at high weight Until quite recently these systems had to be constructed using a standard routing protocol, because that is what the routers would support. BGP is a reasonable choice for this because its interoperability between modern implementations is excellent. The optimization system would peer with the routers, periodically recompute the desired behavior, and export those choices as the best route to destinations. To avoid having the Optimizer be a single point of failure able to bring down the entire network, the routers would retain peering connections with each other at a low weight as a fallback. The fallback routes would never be used so long as the Optimizer routes are present.

This works. It solves real problems. However it is hard to ignore the fact that BGP adds no value in the implementation of the optimization system. Its just an obstacle in the way of getting entries into the forwarding tables of the switch fabric. It also constrains the forwarding behaviors to those which BGP can express, generally some combination of destination address and QoS.

BGP peering between routers, SDN to controller Product support for software defined networking is now appearing in the market. These are generally parallel control paths alongside the existing routing protocols. SDN deposits routes into the same forwarding tables as BGP and OSPF, with some priority or precedence mechanism to control arbitration.

By using an SDN protocol these optimization systems are no longer constrained to what BGP can express, they can operate on any information which the hardware supports. Yet even here there is an awkward interaction with the other protocols. Its useful to keep the peering connections with other routers as a fallback in case of controller failure, but they are not well integrated. We can only set precedences between SDN and BGP and hope for the best.

I do wonder if the existing implementation of routing protocols needs a more significant rethink. There is great value in retaining compatibility with the external interfaces: being able to peer with existing BGP/OSPF/etc nodes is a huge benefit. In contrast, there is little value to retaining the internal implementation choices inside the router. The existing protocols could be made to cooperate more flexibly with other inputs. More speculatively, extensions to the protocol itself could label routes which are expected to be overridden by another source, and only present as a fallback path.

Monday, November 14, 2011

The Computer is the Network

Modern commodity switch fabric chips are amazingly capable, but their functionality is not infinite. In particular their parsing engines are generally fixed function, extracting information from the set of headers they were designed to process. Similarly the ability to modify packets is constrained to specifically designed in protocols, not an infinitely programmable rewrite engine.

Software defined networks are a wonderful thing, but development of an SDN agent to drive an existing ASIC does not suddenly make it capable of packet handling it wasn't already designed to do. At best, it might expose functions of which the hardware was always capable but had not been utilized by the older software. Yet even that is questionable: once a platform goes into production, the expertise necessary to thoroughly test and develop bug workarounds for ASIC functionality rapidly disperses to work on new designs. If part of the functionality isn't ready at introduction it is often removed from the documentation and retargeted as a feature of the next chip.

Decisions at the Edge

MPLS networks have an interesting philosophy: the switching elements at the core are conceptually simple, driven by a label stack prepended to the packet. Decisions are made at the edge of the network wherever possible. The core switches may have complex functionality dealing with fast reroutes or congestion management, but they avoid having to re-parse the payloads and make new forwarding decisions.

Ethernet switches have mostly not followed this philosophy, in fact we've essentially followed the opposite path. We've tended to design in features and capacity at the same time. Larger switch fabrics with more capacity also tend to have more features. Initially this happened because a chip with more ports required a larger silicon die to have room for all of the pins. Thus, there was more room for digital logic. Vendors have accentuated this in their marketing plans, omitting software support for features in "low end" edge switches even if they use the same chipset as the more featureful aggregation products.

This leaves software defined networking in a bit of a quandary. The MPLS model is simpler to reason about for large collections of switches, you don't have a combinatorial explosion of decision-making at each hop in the forwarding. Yet non-MPLS Ethernet switches have mostly not evolved in that way, and the edge switches don't have the capability to make all of the decisions for behaviors we might want.

Software Switches to the Rescue

A number of market segments have gradually moved to a model where the first network element to touch the packet is implemented mostly in software. This allows the hope of substantially increasing their capability. A few examples:

Datacenters: The first hop is a software switch running in the Hypervisor, like the VMware vSwitch or Cisco Nexus 1000v.

Wide Area Networks: WAN optimizers have become quite popular because they save money by reducing the amount of traffic sent over the WAN. These are mostly software products at this point, implementing protocol-specific compression and deduplication. Forthcoming 10 Gig products from Infineta appear to be the first products containing significant amounts of custom hardware.

Wifi AP with CPU, Wifi MAC, and Ethernet MAC

Wifi Access Points: Traditional, thick APs as seen in the consumer and carrier-provided equipment market are a CPU with Ethernet and Wifi, forwarding packets in software.
Thin APs for Enterprise use as deployed by Aruba/Airespace/etc are rather different, the real forwarding happens in hardware back at a central controller.

Carrier Network Access Units: Like Wifi APs, access gear for DSL and DOCSIS networks is usually a CPU with the appropriate peripherals and forwards frames in software.

Enterprise switch with CPU handling all packets, and a big red X through it

Enterprise: Just kidding, the Enterprise is still firmly in the "more hardware == more better" category. Most of the problems to be solved in Enterprise networking today deal with access control, security, and malware containment. Though CPU forwarding at the edge is one solution to that (attempted by ConSentry and Nevis, among others), the industry mostly settled on out of band approaches.

The Computer is the Network

The Sun Microsystems tagline through most of the 1980s was The Network is the Computer. At the time it referred to client-server computing like NFS and RPC, though the modern web has made this a reality for many people who spend most of their computing time with social and communication applications via the web. Its a shame that Sun itself didn't live to see the day.

We're now entering an era where the Computer is the Network. We don't want to depend upon the end-station itself to mark its packets appropriately, mainly due to security and malware considerations, but we want the flexibility of having software touch every packet. Market segments which provide that capability, like datacenters, WAN connections, and even service providers, are going to be a lot more interesting in the next several years.

Friday, October 28, 2011

Tweetflection Point

Last week at the Web 2.0 Summit in San Francisco, Twitter CEO Dick Costolo talked about recent growth in the service and how iOS5 had caused a sudden 3x jump in signups. He also said daily Tweet volume had reached 250 million. There are many, many estimates of the volume of Tweets sent, but I know of only three which are verifiable as directly from Twitter:

50M tweets/day in March, 2010 according to a Twitter blog post.
140M tweets/day in March, 2011 according to that same Twitter blog post.
250M tweets/day in late October, 2011 according to Dick Costolo.

Graphing these on a log scale shows the rate of growth in Tweet volume, ~~roughly tripling in two years~~ almost tripling in one year.

This graph is misleading though, as we have so few data points. It is very likely that, like signups for the service, the rate of growth in tweet volume suddenly increased after iOS5 shipped. Lets assume the rate of growth also tripled for the few days after the iOS5 launch, and zoom in on the tail end of the graph. It is quite similar up until a sharp uptick at the end.

Speculative graph of average daily Tweet volume, knee of curve at iOS5 launch.

The reality is somewhere between those two graphs, but likely still steep enough to be terrifying to the engineers involved. iOS5 will absolutely have an impact on the daily volume of Tweets, it would be ludicrous to think otherwise. It probably isn't so abrupt a knee in the curve as shown here, but it has to be substantial. Tweet growth is on a new and steeper slope now. It used to triple in a bit over a year, now it will triple in way less than one year.

Why this matters

Even five months ago, the traffic to carry the Twitter Firehose was becoming a challenge to handle. At that time the average throughput was 35 Mbps, with spikes up to about 138 Mbps. Scaling those numbers to today would be 56 Mbps sustained with spikes to 223 Mbps, and about one year until the spikes exceed a gigabit.

The indications I've seen are that the feed from Twitter is still sent uncompressed. Compressing using gzip (or Snappy) would gain some breathing room, but not solve the underlying problem. The underlying problem is that the volume of data is increasing way, way faster than the capacity of the network and computing elements tasked with handling it. Compression can reduce the absolute number of bits being sent (at the cost of even more CPU), but not reduce the rate of growth.

Fundamentally, there is a limit to how fast a single HTTP stream can go. As described in the post earlier this year, we've scaled network and CPU capacity by going horizontal and spreading load across more elements. Use of a single very fast TCP flow restricts the handling to a single network link and single CPU in a number of places. The network capacity has some headroom still, particularly by throwing money at it in the form of 10G Ethernet links. The capacity of a single CPU core to process the TCP stream is the more serious bottleneck. At some point relatively soon it will be more cost effective to split the Twitter firehose across multiple TCP streams, for easier scaling. The Tweet ID (or a new sequence number) could put tweets back into an absolute order when needed.

Unbalanced link aggregation with a single high speed HTTP firehose.

Update: My math was off. Even before the iOS5 announcement, the rate of growth was nearly tripling in one year. Corrected post.

Monday, October 24, 2011

Well Trodden Technology Paths

Modern CPUs are astonishingly complex, with huge numbers of caches, lookaside buffers, and other features to optimize performance. Hardware reset is generally insufficient to initialize these features for use: reset leaves them in a known state, but not a useful one. Extensive software initialization is required to get all of the subsystems working.

Its quite difficult to design such a complex CPU to handle its own initialization out of reset. The hardware verification wants to focus on the "normal" mode of operation, with all hardware functions active, but handling the boot case requires that a vast number of partially initialized CPU configurations also be verified. Any glitch in these partially initialized states results in a CPU which cannot boot, and is therefore useless.

Large CPU with many cores, and a small 68k CPU in the corner.

Many, and I'd hazard to guess most, complex CPU designs reduce their verification cost and design risk by relying on a far simpler CPU buried within the system to handle the earliest stages of initialization. For example, the Montalvo x86 CPU design contained a small 68000 core to handle many tasks of getting it running. The 68k was an existing, well proven logic design requiring minimal verification by the ASIC team. That small CPU went through the steps to initialize all the various hardware units of the larger CPUs around it before releasing them from reset. It ran an image fetched from flash which could be updated with new code as needed.

Warning: Sudden Segue Ahead

Networking today is at the cusp of a transition, one which other parts of the computing market have already undergone. We're quickly shifting from fixed-function switch fabrics to software defined networks. This shift bears remarkable similarities to the graphics industry shifting from fixed 3D pipelines to GPUs, and of CPUs shedding their coprocessors to focus on delivering more general purpose computing power.

Networking will also face some of the same issues as modern CPUs, where the optimal design for performance in normal operation is not suitable for handling its own control and maintenance. Last week's ruminations about L2 learning are one example: though we can make a case for software provisioning of MAC addresses, the result is a network which doesn't handle topology changes without software stepping in to reprovision.

The control network cannot, itself, seize up whenever there is a problem. The control network has to be robust in handling link failures and topology changes, to allow software to reconfigure the rest of the data-carrying network. This could mean an out of band control network, hooked to the ubiquitous Management ports on enterprise and datacenter switches. It might also be that a VLAN used for control operates in a very different fashion than one used for data, learning L2 addresses dynamically and using MSTP to handle link failures.

All in all, its an exciting time to be in networking.

Wednesday, October 19, 2011

Layer 2 History

Why use L2 networks in datacenters?
Virtual machines need to move from one physical server to another, to balance load. To avoid disrupting service, their IP address cannot change as a result of this move. That means the servers need to be in the same L3 subnet, leading to enormous L2 networks.

Why are enormous L2 networks a problem?
A switch looks up the destination MAC address of the packet it is forwarding. If the switch knows what port that MAC address is on, it sends the packet to that port. If the switch does not know where the MAC address is, it floods the packet to all ports. The amount of flooding traffic tends to rise as the number of stations attached to the L2 network increases.

Transition from half duplexed Ethernet to L2 switching.

Why do L2 switches flood unknown address packets?
So they can learn where that address is. Flooding the packet to all ports means that if that destination exists, it should see the packet and respond. The source address in the response packet lets the switches learn where that address is.

Why do L2 switches need to learn addresses dynamically?
Because they replaced simpler repeaters (often called hubs). Repeaters required no configuration, they just repeated the packet they saw on one segment to all other segments. Requiring extensive configuration of MAC addresses for switches would have been an enormous drawback.

Why did repeaters send packets to all segments?
Repeaters were developed to scale up Ethernet networks. Ethernet at that time mostly used coaxial cable. Once attached to the cable, the station could see all packets from all other stations. Repeaters kept that same property.

How could all stations see all packets?
There were limits placed on the maximum cable length, propagation delay through a repeater, and the number of repeaters in an Ethernet network. The speed of light in the coaxial cable used for the original Ethernet networks is 0.77c, or 77% of the speed of light in a vacuum. Ethernet has a minimum packet size to allow sufficient time for the first bit of the packet to propagate all the way across the topology and back before the packet ends transmission.

So there you go. We build datacenter networks this way because of the speed of light in coaxial cable.

Wednesday, October 5, 2011

Non Uniform Network Access

Four CPUs in a ring, with RAM attached to each. Non Uniform Memory Access is common in modern x86 servers. RAM is connected to each CPU, which connect to each other. Any CPU can access any location in RAM, but will incur additional latency if there are multiple hops along the way. This is the non-uniform part: some portions of memory take longer to access than others.

Yet the NUMA we use today is NUMA in the small. In the 1990s NUMA aimed to make very, very large systems commonplace. There were many levels of bridging, each adding yet more latency. RAM attached to the local CPU was fast, RAM for other CPUs on the same board was somewhat slower. RAM on boards in the same local grouping took longer still, while RAM on the other side of the chassis took forever. Nonetheless this was considered to be a huge advancement in system design because it allowed the software to access vast amounts of memory in the system with a uniform programming interface... except for performance.

Operating system schedulers which had previously run any task on any available CPU would randomly exhibit extremely bad behavior: a process running on distant combinations of CPU and RAM would run an order of magnitude slower. NUMA meant that all RAM was equal, but some was more equal than others. Operating systems desperately added notions of RAM affinity to go along with CPU and cache affinity, but reliably good performance was difficult to achieve.

As an industry we concluded that NUMA in moderation is good, but too much NUMA is bad. Those enormous NUMA systems have mostly lost out to smaller servers clustered together, where each server uses a bit of NUMA to improve its own scalability. The big jump in latency to get to another server is accompanied by a change in API, to use the network instead of memory pointers.

A Segue to Web Applications

Modern web applications can make tradeoffs between CPU utilization, memory footprint, and network bandwidth. Increase the amount of memory available for caching, and reduce the CPU required to recalculate results. Shard the data across more nodes to reduce the memory footprint on each at the cost of increasing network bandwidth. In many cases these tradeoffs don't need to be baked deep in the application, they can be tweaked via relatively simple changes. They can be adjusted to tune the application for RAM size, or for the availability of network bandwidth.

Further Segue To Overlay Networks

There is a lot of effort being put into overlay networks for virtualized datacenters, to create an L2 network atop an L3 infrastructure. This allows the infrastructure to run as an L3 network, which we are pretty good at scaling and managing, while the service provided to the VMs behaves as an L2 network.

Yet once the packets are carried in IP tunnels they can, through the magic of routing, be carried across a WAN to another facility. The datacenter network can be transparently extended to include resources in several locations. Transparently, except for performance. The round trip time across a WAN will inevitably be longer than the LAN, the speed of light demands it. Even for geographically close facilities the bandwidth available over a WAN will be far less than the bandwidth available within a datacenter, perhaps orders of magnitude less. Application tuning parameters set based on the performance within a single datacenter will be horribly wrong across the WAN.

I've no doubt that people will do it anyway. We will see L2 overlay networks being carried across VPNs to link datacenters together transparently (except for performance). Like the OS schedulers suddenly finding themselves in a NUMA world, software infrastructure within the datacenter will find itself in a network where some links are more equal than others. As an industry, we'll spend a couple years figuring out whether that was a good idea or not.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, October 2, 2011

NVGRE Musings

It is an interesting time to be involved in datacenter networking. There have been announcements recently of two competing proposals for running virtual L2 networks as an overlay atop a underlying IP network, VXLAN and NVGRE. Supporting an L2 service is important for virtualized servers, which need to be able to move from one physical server to another without changing their IP address or interrupting the services they provide. Having written about VXLAN in a series of three posts, now it is time for NVGRE. Ivan Pepelnjak has already posted about it on IOShints, which I recommend reading.

NVGRE encapsulates L2 frames inside tunnels to carry them across an L3 network. As its name implies, it uses GRE tunneling. GRE has been around for a very long time, and is well supported by networking gear and analysis tools. An NVGRE Endpoint uses the Key field in the GRE header to hold the Tenant Network Identifier (TNI), a 24 bit space of virtual LANs.

The encapsulated packet has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the NVGRE Endpoint is likely to be a software component within the server, prior to hitting any NIC, the frames have no CRC. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

The NVGRE draft refers to IP addresses in the outer header as Provider Addresses, and the inner header as Customer Addresses. NVGRE can optionally also use an IP multicast group for each TNI to distribute L2 broadcast and multicast packets.

Not Quite Done

As befits its "draft" designation, a number of details in the NVGRE proposal are left to be determined in future iterations. One largish bit left unspecified is mapping of Customer Addresses to Provider. When an NVGRE Endpoint needs to send a packet to a remote VM, it must know the address of the remote NVGRE Endpoint. The mechanism to maintain this mapping is not yet defined, though it will be provisioned by a control function communicating with the Hypervisors and switches.

Optional Multicast?

The NVGRE draft calls out broadcast and multicast support as being optional, only if the network operator chooses to support it. To operate as a virtual Ethernet network a few broadcast protocols are essential, like ARP and IPv6 ND. Presumably if broadcast is not available, the NVGRE Endpoint would respond to these requests to its local VMs.

Yet I don't see how that can work in all cases. The NVGRE control plane can certainly know the Provider Address of all NVGRE Endpoints. It can know the MAC address of all guest VMs within the tenant network, because the Hypervisor provides the MAC address as part of the virtual hardware platform. There are notable exceptions where guest VMs use VRRP, or make up locally administered MAC addresses, but I'll ignore those for now.

I don't see how an NVGRE Endpoint can know all Customer IP Addresses. One of two things would have to happen:

Require all customer VMs to obtain their IP from the provider. Even backend systems using private, internal addresses would have to get them from the datacenter operator so that NVGRE can know where they are.
Implement a distributed learning function where NVGRE Endpoints watch for new IP addresses sent by their VMs and report them to all other Endpoints.

The current draft of NVGRE makes no mention of either such function, so we'll have to watch for future developments.

The earlier VL2 network also did not require multicast and handled ARP via a network-wide directory service. Many VL2 concepts made their way into NVGRE. So far as I understand it, VL2 assigned all IP addresses to VMs and could know where they were in the network.

Multipathing

Load balancing across four links between switches. An important topic for tunneling protocols is multipathing. When multiple paths are available to a destination, either LACP at L2 or ECMP at L3, the switches have to choose which link to use. It is important that packets on the same flow stay in order, as protocols like TCP use excessive reordering as an indication of congestion. Switches hash packet headers to select a link, so packets with the same headers will always choose the same link.

Tunneling protocols have issues with this type of hashing: all packets in the tunnel have the same header. This limits them to a single link, and congests that one link for other traffic. Some switch chips implement extra support for common tunnels like GRE, to include the Inner header in the hash computation. NVGRE would benefit greatly from this support. Unfortunately, it is not universal amongst modern switches.

Choosing Provider Address by hashing the Inner headers. The NVGRE draft proposes that each NVGRE Endpoint have multiple Provider Addresses. The Endpoints can choose one of several source and destination IP addresses in the encapsulating IP header, to provide variance to spread load across LACP and ECMP links. The draft says that when the Endpoint has multiple PAs, each Customer Address will be provisioned to use one of them. In practice I suspect it would be better were the NVGRE Endpoint to hash the Inner headers to choose addresses, and distribute the load for each Customer Address across all links.

Using multiple IP addresses for load balancing is clever, but I can't easily predict how well it will work. The number of different flows the switches see will be relatively small. For example if each endpoint has four addresses, the total number of different header combinations between any two endpoints is sixteen. This is sixteen times better than having a single address each, but it is still not a lot. Unbalanced link utilization seems quite possible.

Aside: Deliberate Multipathing

One LACP group feeding in to the next. The relatively limited variance in headers leads to an obvious next step: ensure the traffic will be balanced by predicting what the switch will do, and choose Provider IP addresses to optimize and ensure it is well balanced. In networking today we tend to solve problems by making the edges smarter.

The NVGRE draft says that selection of a Provider Address is provisioned to the Endpoint. Each Customer Address will be associated with exactly one Provider Address to use. I suspect that selection of Provider Addresses is expected to be done via an optimization mechanism like this, but I'm definitely speculating.

I'd caution that this is harder than it sounds. Switches use the ingress port as part of the hash calculation. That is, the same packet arriving on a different ingress port will choose a different egress link within the LACP/ECMP group. To predict behavior one needs a complete wiring diagram of the network. In the rather common case where several LACP/ECMP groups are traversed along the way to a destination, the link selected by each previous switch influences the hash computation of the next.

Misc Notes

The NVGRE draft mentions keeping an MTU state per Endpoint, to avoid fragmentation. Details will be described in future drafts. NVGRE certainly benefits from a datacenter network with a larger MTU, but will not require it.
VXLAN describes its overlay network as existing within a datacenter. NVGRE explicitly calls for spanning across wide area networks via VPNs, for example to connect a corporate datacenter to additional resources in a cloud provider. I'll have to cover this aspect in another post, this post is too long already.

Conclusion

Its quite difficult to draw a conclusion about NVGRE, as so much is still unspecified. There are two relatively crucial mapping functions which have yet to be described:

When a VM wants to contact a remote Customer IP and sends an ARP Request, in the absence of multicast, how can the matching MAC address be known?
When the NVGRE Endpoint is handed a frame destined to a remote Customer MAC, how does it find the Provider Address of the remote Endpoint?

So we'll wait and see.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, September 18, 2011

VXLAN Conclusion

This is the third and final article in a series about VXLAN. I recommend reading the first and second articles before this one. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

Foreign Gateways

Though I've consistently described VXLAN communications as occurring between VMs, many datacenters have a mix of virtual servers with single-instance physical servers. Something has to provide the VTEP function for all nodes on the network, but it doesn't have to be the server itself. A Gateway function can bridge to physical L2 networks, and with representatives of several switch companies as authors of the RFC this seems likely to materialize within the networking gear itself. The Gateway can also be provided by a server sitting within the same L2 domain as the servers it handles.

Gateway to communicate with other physical servers on an L2 segment.

Even if the datacenter consists entirely of VMs, a Gateway function is still needed in the switch. To communicate with the Internet (or anything else outside of their subnet) the VMs will ARP for their next hop router. This router has to have a VTEP.

Transition Strategy

Mixture of VTEP-enabled servers with non requires a gateway function somewhere I'm tempted to say there isn't a transition strategy. Thats a bit too harsh in that the Gateway function just described can serve as a proxy, but its not far from the mark. As described in the RFC, the VTEP assumes that all destination L2 addresses will be served by a remote VTEP somewhere. If the VTEP doesn't know the L3 address of the remote node to send to, it floods the packet to all VTEPs using multicast. There is no provision for direct L2 communication to nodes which have no VTEP. It is assumed that an existing installation of VMs on a VLAN will be taken out of service, and all nodes reconfigured to use VXLAN. VLANs can be converted individually, but there is no provision operation with a mixed set of VTEP-enabled and non-VTEP-enabled nodes on an existing VLAN.

For an existing datacenter which desires to avoid scheduling downtime for an entire VLAN, one transition strategy would use a VTEP Gateway as the first step. When the first server is upgraded to use VXLAN and have its own VTEP, all of its packets to other servers will go through this VTEP Gateway. As additional servers are upgraded they will begin communicating directly between VTEPs, and rely on the Gateway to maintain communication with the rest of their subnet.

Where would the Gateway function go? During the transition, which could be lengthy, the Gateway VTEP will be absolutely essential for operation. It shouldn't be a single point of failure, and this should trigger the network engineer's spidey sense about adding a new critical piece of infrastructure. It will need to be monitored, people will need to be trained in what to do if it fails, etc. Therefore it seems far more likely that customers will choose to upgrade their switches to include the VTEP Gateway function, so as not to add a new critical bit of infrastructure.

Controller to the Rescue?

Mixture of VTEP-enabled servers with non requires a gateway function somewhere What makes this transition strategy difficult to accept is that VMs have to be configured to be part of a VXLAN. They have to be assigned to a particular VNI, and that VNI has to be given an IP multicast address to use for flooding. Therefore something, somewhere knows the complete list of VMs which should be part of the VXLAN. In Rumsfeldian terms, there are only known unknown addresses and no unknown unknowns. That is, the VTEP can know the complete list of destination MAC addresses it is supposed to be able to reach via VXLAN. The only unknown is the L3 address of the remote VTEP. If the VTEP encounters a destination MAC address which it doesn't know about, it doesn't have to assume it is attached to a VTEP somewhere. It could know that some MAC addresses are reached directly, without VXLAN encapsulation.

The previous article in this series brought up the reliance on multicast for learning as an issue, and suggested that a VXLAN controller would be an important product to offer. That controller could also provide a better transition strategy, allowing VTEPs to know that some L2 addresses should be sent directly to the wire without a VXLAN tunnel. This doesn't make the controller part of the dataplane: it is only involved when stations are added or removed from the VXLAN. During normal forwarding, the controller is not involved.

It is safe to say that the transition strategy for existing, brownfield datacenter networks is the part of the VXLAN proposal which I like the least.

Other miscellaneous notes

VXLAN prepends 42 bytes of headers to the original packet. To avoid IP fragmentation the L3 network needs to handle a slightly larger frame size than standard Ethernet. Support for Jumbo frames is almost universal in networking gear at this point, this should not be an issue.

There is only a single multicast group per VNI. All broadcast and multicast frames in that VXLAN will be sent to that one IP multicast group and delivered to all VTEPs. The VTEP would likely run an IGMP Snooping function locally to determine whether to deliver multicast frames to its VMs. VXLAN as currently defined can't prune the delivery tree, all VTEPs must receive all frames. It would be nice to be able to prune delivery within the network, and not deliver to VTEPs which have no subscribing VMs. This would require multiple IP multicast groups per VNI, which would complicate the proposal.

Conclusion

I like the VXLAN proposal. I view the trend toward having enormous L2 networks in datacenters as disturbing, and see VXLAN as a way to give the VMs the network they want without tying it to the underlying physical infrastructure. It virtualizes the network to meet the needs of the virtual servers.

After beginning to publish these articles on VXLAN I became aware of another proposal, NVGRE. There appear to be some similarities, including the use of IP multicast to fan out L2 broadcast/multicast frames, and the two proposals even share an author in common. NVGRE uses GRE encapsulation instead of the UDP+VXLAN header, with multiple L2 addresses to provide load balancing across LACP/ECMP links. It will take a while to digest, but I expect to write some thoughts about NVGRE in the future.

Many thanks to Ken Duda, whose patient explanations of VXLAN on Google+ made this writeup possible.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Saturday, September 17, 2011

VXLAN Part Deux

This is the second of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

I strongly recommend reading the first article before this one, to provide background.

UDP Encapsulation

In addition to the IP tunnel header and a VXLAN header, there is also an Outer UDP header. One might reasonably ask why it is there, as VXLAN could have been directly encapsulated within IP.

Four paths between routers, hashing headers chooses one. The UDP header serves an interesting purpose, it isn't there to perform the multiplexing role UDP normally serves. When switches have multiple paths available to a destination, whether an L2 trunk or L3 multipathing, the specific link is chosen by hashing packet headers. Most switch hardware is quite limited in how it computes the hash: the outermost L2/L3/L4 headers. Some chips can examine the inner headers of long-established tunneling protocols like GRE/MAC-in-MAC/IP-in-IP. For a new protocol like VXLAN, it would take years for silicon support for the inner headers to become common.

Therefore the VTEP calculates a hash of the inner packet headers, and places it in the source UDP port where it feeds into LACP/ECMP hash calculation. Existing switch chips get proper load balancing using only the Outer L2/L3/L4 headers, at the cost of 8 bytes of overhead.

VTEP calculates hash of inner packet headers, places it in the UDP source port.

New protocols sometimes encapsulate themselves inside UDP headers to more easily traverse firewalls and NAT devices. That isn't what VXLAN is doing, it would be somewhat ludicrous to put firewalls between subnets within a datacenter. In fact, the way VXLAN uses its UDP header can make firewall traversal a bit more challenging. The inner packet headers can hash to a well known UDP port number like 53, making it look like a DNS response, but a firewall attempting to inspect the contents of the frame will not find a valid DNS packet. It would be important to disable any deep packet inspection for packets traveling between VTEP endpoints. If VXLAN is used to extend an L2 network all the way across a WAN the firewall question becomes more interesting. I don't think its a good idea to have a VXLAN cross a WAN, but that will have to be a topic for another day.

VTEP Learning

VTEP Table of MAC:OuterIP mappings. The VTEP examines the destination MAC address of frames it handles, looking up the IP address of the VTEP for that destination. This MAC:OuterIP mapping table is populated by learning, very much like an L2 switch discovers the port mappings for MAC addresses. When a VM wishes to communicate with another VM it generally first sends a broadcast ARP, which its VTEP will send to the multicast group for its VNI. All of the other VTEPs will learn the Inner MAC address of the sending VM and Outer IP address of its VTEP from this packet. The destination VM will respond to the ARP via a unicast message back to the sender, which allows the original VTEP to learn the destination mapping as well.

When a MAC address moves, the other VTEPs find its new location by the same learning process, using the first packet they see from its new VTEP. Why might a MAC address move? Consider a protocol like VRRP, which fails over a MAC address between two redundant servers. When ownership of a VRRP MAC address switches from one VM to another, all of the other VTEPs on the network need to learn the new MAC:OuterIP association. VRRP typically sends a gratuitous ARP when it fails over, and as a broadcast packet that ARP will be sent to all VTEPs. They learn the new MAC:OuterIP association from that packet.

VRRP nicely sends a gratuitous ARP when the MAC address moves, but not all MAC moves will do so. Consider the case where a running VM is frozen and moved to another server. The VM will resume where it left off, its ARP table fully populated for nodes it is communicating with. It won't send a gratuitous ARP because the VM has no idea that it has moved to a new vserver, and it won't send ARPs for addresses already in its table either. Its possible I've missed some subtlety, but I don't see how remote VTEPs would quickly learn the new location of the MAC address. I think they continue sending to the incorrect VTEP until their entries time out, and then they start flooding to the VXLAN multicast address.

Multicast frame delivered to 3 VTEPs but dropped before reaching one. Though it is appealing to let VTEPs track each other automatically using multicast and learning, I suspect beyond a certain scale of network that isn't going to work very well. Multicast frames are not reliably delivered, and because they fan out to all nodes they tend to become ever less reliable as the number of nodes increases. The RFC mentions the possibility of other mechanisms to populate the VTEP tables, including centralized controllers. I suspect a controller will be an important product to offer. Troubleshooting why subsets of VMs transiently lose the ability to communicate after a move or failover would be really annoying. Small networks could rely on multicast, while larger networks could fall back to it if the controller fails.

Suggestions

On the off chance that people read this far, I'll offer a couple suggestions for modifications to the VXLAN specification based on discussion earlier in the article.

If VXLAN is used to connect remote facilities, it is likely to traverse firewalls. When the VTEP calculates a hash of the Inner headers to place in the UDP source port field, I'd recommend it always set the most significant bit. This restricts the hash to 15 bits, values 32768 - 65535, but avoids any low numbered port number with a defined meaning like DNS. This should still result in good LACP/ECMP hashing, as this makes VXLAN packets look like ephemeral ports used by UDP client applications.
When a VTEP sees a new source MAC address from a local VM, flood the packet even if the OuterIP of the destination is already known. This gives remote VTEPs a better chance of noticing a MAC move. The VTEP already had to keep track of local MAC addresses to properly deliver received frames, so I suspect there is already a local source learning function.

Next article: A few final VXLAN topics.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Friday, September 16, 2011

The Care and Feeding of VXLAN

This is the first of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

Modern datacenter networks are designed around the requirements for virtualization. There are a number of physical servers connected to the network, each hosting a larger number of virtual servers. The Gentle Reader may be familiar with VMware on the desktop, with NAT to let VMs communicate with the Internet. Datacenter VMs don't work that way. Each VM has its own IP and MAC address, with software beneath the VMs to function as a switch. On the Ethernet we see a huge number of MAC addresses, from all of the VMs throughout the facility.

To rebalance server load it is necessary to support moving a virtual machine from a heavily loaded server to another more lightly loaded one. It is essentially impossible to change the IP address of a VM as part of this move. Though modern server OSes can change IP address without rebooting, doing so interrupts the service as client connections are closed and caches spread far and wide have to time out. To be useful, VM moves have to be mostly invisible and not interrupt services.

Datacenter networkmoving a VM from one physical server to another.

The most straightforward way to accomplish this is to have the servers sit in a single subnet and single broadcast domain, which leads to some truly enormous L2 networks. It is putting pressure on switch designs to support tens of thousands of MAC addresses (thankfully via hash engines, rather than CAMs). Everything we'd learned about networking to this point drove towards Layer3 networks for scalability, but it is all being rewritten for virtualized datacenters.

Enter VXLAN

At VMWorld several weeks ago there was an announcement of VXLAN, Virtual eXtensible LANs. I paid little attention at the time, but should have paid more. VXLAN is an encapsulation scheme to carry L2 frames atop L3 networks. Described like that it doesn't sound very interesting, but the details are well thought out. Additionally the RFC is authored by representatives from major virtualization vendors VMware, Citrix, and Red Hat, and by datacenter switch vendors Cisco and Arista. It will appear in real networks relatively quickly.

The bulk of the VXLAN implementation is handled via a tunneling endpoint. This will generally reside within the virtual server, in software running under the VMs. A new component called the VXLAN Tunnel End Point (VTEP) encapsulates frames inside an L3 tunnel. There can be 224 VXLANs, identified by a 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of known destination MAC addresses, and stores the IP address of the tunnel to the remote VTEP to use for each. Unicast frames between VMs are sent directly to the unicast L3 address of the remote VTEP. Multicast and broadcast frames from VMs are sent to a multicast IP group associated with the VNI. The spec is vague on how VNIs map to a multicast IP address, merely saying that a management plane configures it along with VM membership in the VXLAN. Multicast distribution in most networks is something of an afterthought, making the address configurable allows VXLAN to cope with whatever facilities exist.

Overlay network connecting VMs through an L3 switch.

A Maze of Twisty Little Passages

VXLAN encapsulates L2 packets from the VMs within an Outer IP header to send across an IP network. The receiving VTEP decapsulates the packet, and consults the Inner headers to figure out how to deliver it to its destination.

The encapsulated packet retains its Inner MAC header and optional Inner VLAN, but has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the VTEP is a software component within the server, prior to hitting any NIC, the frames have no CRC when the VTEP gets them. Therefore there is no integrity protection end to end, from the originating VM to receiving. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

Next article: UDP

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.