Monday, November 28, 2011

QFabric Followup

In August this site published a series of posts about the Juniper QFabric. Since then Juniper has released hardware documentation for the QFabric components, so its time for a followup.

QF edge Nodes, Interconnects, and DirectorsQFabric consists of Nodes at the edges wired to large Interconnect switches in the core. The whole collection is monitored and managed by out of band Directors. Juniper emphasizes that the QFabric should be thought of as a single distributed switch, not as a network of individual switches. The entire QFabric is managed as one entity.

Control header prepended to frameThe fundamental distinction between QFabric and conventional switches is in the forwarding decision. In a conventional switch topology each layer of switching looks at the L2/L3 headers to figure out what to do. The edge switch sends the packet to the distribution switch, which examines the headers again before sending the packet on towards the core (which examines the headers again). QFabric does not work this way. QFabric functions much more like the collection of switch chips inside a modular chassis: the forwarding decision is made by the ingress switch and is conveyed through the rest of the fabric by prepending control headers. The Interconnect and egress Node forward the packet according to its control header, not via another set of L2/L3 lookups.


Node Groups

The Hardware Documentation describes two kinds of Node Groups, Server and Network, which gather multiple edge Nodes together for common purposes.

  • Server Node Groups are straightforward: normally the edge Nodes are independent, connecting servers and storage to the fabric. Pairs of edge switches can be configured as Server Node Groups for redundancy, allowing LAG groups to span the two switches.
  • Network Node Groups configure up to eight edge Nodes to interconnect with remote networks. Routing protocols like BGP or OSPF run on the Director systems, so the entire Group shares a common Routing Information Base and other data.

Why have Groups? Its somewhat easier to understand the purpose of the Network Node Group: routing processes have to be spun up on the Directors, and perhaps those processes have to point to some distinct entity to operate with. Why have Server Node Groups, though? Redundant server connections are certainly beneficial, but why require an additional fabric configuration to allow it?

Ingress fanout to four LAG member portsI don't know the answer, but I suspect it has to do with Link Aggregation (LAG). Server Node Groups allow a LAG to be configured using ports spanning the two Nodes. In a chassis switch, LAG is handled by the ingress chip. It looks up the destination address to find the destination port. Every chip knows the membership of all LAGs in the chassis. The ingress chip computes a hash of the packet to pick which LAG member port to send the packet to. This is how LAG member ports can be on different line cards, the ingress port sends it to the correct card.

Ingress fanout to four LAG member portsThe downside of implementing LAG at ingress is that every chip has to know the membership of all LAGs in the system. Whenever a LAG member port goes down, all chips have to be updated to stop using it. With QFabric, where ingress chips are distributed across a network and the largest fabric could have thousands of server LAG connections, updating all of the Nodes whenever a link goes down could take a really long time. LAG failure is supposed to be quick, with minimal packet loss when a link fails. Therefore I wonder if Juniper has implemented LAG a bit differently, perhaps by handling member port selection in the Interconnect, in order to minimize the time to handle a member port failure.

I feel compelled to emphasize again: I'm making this up. I don't know how QFabric is implemented nor why Juniper made the choices they made. Its just fun to speculate.


Virtualized Junos

Regarding the Director software, the Hardware Documentation says, "[Director devices] run the Junos operating system (Junos OS) on top of a CentOS foundation." Now that is an interesting choice. Way, way back in the mists of time, Junos started from NetBSD as its base OS. NetBSD is still a viable project and runs on modern x86 machines, yet Juniper chose to hoist Junos atop a Linux base instead.

I suspect that in the intervening time, the Junos kernel and platform support diverged so far from NetBSD development that it became impractical to integrate recent work from the public project. Juniper would have faced a substantial effort to handle modern x86 hardware, and chose instead to virtualize the Junos kernel in a VM whose hardware was easier to support. I'll bet the CentOS on the Director is the host for a Xen hypervisor.

Update: in the comments, Brandon Bennett and Julien Goodwin both note that Junos used FreeBSD as its base OS, not NetBSD.

Aside: with network OSes developed in the last few years, companies have tended to put effort into keeping the code portable enough to run on a regular x86 server. The development, training, QA, and testing benefits of being able to run on a regular server are substantial. That means implementing a proper hardware abstraction layer to handle running on a platform which doesn't have the fancy switching silicon. In the 1990s when Junos started, running on x86 was not common practice. We tended to do development on Sparcstations, DECstations, or some other fancy RISC+Unix machine and didn't think much about Intel. The RISC systems were so expensive that one would never outfit a rack of them for QA, it was cheaper to build a bunch of switches instead.

Aside, redux: Junosphere also runs Junos as a virtual machine. In a company the size of Juniper these are likely to have been separate efforts, which might not even have known about each other at first. Nonetheless the timing of the two products is close enough that there may have been some cross-group pollination and shared underpinnings.


Misc Notes

  • The Director communicates with the Interconnects and Nodes via a separate control network, handled by Juniper's previous generation EX4200. This is an example of using a simpler network to bootstrap and control a more complex one.
  • QFX3500 has four QSFPs for 40 gig Ethernet. These can each be broken out into four 10G Ethernet ports, except the first one which supports only three 10G ports. That is fascinating. I wonder what the fourth one does?

Thats all for now. We may return to QFabric as it becomes more widely deployed or as additional details surface.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, November 23, 2011

Unnatural BGP

Last week Martin Casado published some thoughts about using OpenFlow and Software Defined Networking for simple forwarding. That is, does SDN help in distributing shortest path routes for IP prefixes? BGP/OSPF/IS-IS/etc are pretty good for this, with the added benefit of being fully distributed and thoroughly debugged.

The full article is worth a read. The summary (which Martin himself supplied) is "I find it very difficult to argue that SDN has value when it comes to providing simple connectivity." Existing routing protocols are quite good at distributing shortest path prefix routes, the real value of SDN is in handling more complex behaviors.

To expand on this a bit, there have been various efforts over the years to tailor forwarding behavior using more esoteric cost functions. The monetary cost of using a link is a common one to optimize for, as it provides justification for spending on a development effort and also because the business arrangements driving the pricing tend not to distill down to simple weights on a link. Providers may want to keep their customer traffic off of competing networks who are in a position to steal the customer. Transit fees may kick in if a peer delivers significantly more traffic than it receives, providing an incentive to preferentially send traffic through a peer in order to keep the business arrangement equitable. Many of these examples are covered in slides from a course by Jennifer Rexford, who spent several years working on such topics at AT&T Research.

BGP peering between routers at low weight, from each router to controller at high weightUntil quite recently these systems had to be constructed using a standard routing protocol, because that is what the routers would support. BGP is a reasonable choice for this because its interoperability between modern implementations is excellent. The optimization system would peer with the routers, periodically recompute the desired behavior, and export those choices as the best route to destinations. To avoid having the Optimizer be a single point of failure able to bring down the entire network, the routers would retain peering connections with each other at a low weight as a fallback. The fallback routes would never be used so long as the Optimizer routes are present.

This works. It solves real problems. However it is hard to ignore the fact that BGP adds no value in the implementation of the optimization system. Its just an obstacle in the way of getting entries into the forwarding tables of the switch fabric. It also constrains the forwarding behaviors to those which BGP can express, generally some combination of destination address and QoS.

BGP peering between routers, SDN to controllerProduct support for software defined networking is now appearing in the market. These are generally parallel control paths alongside the existing routing protocols. SDN deposits routes into the same forwarding tables as BGP and OSPF, with some priority or precedence mechanism to control arbitration.

By using an SDN protocol these optimization systems are no longer constrained to what BGP can express, they can operate on any information which the hardware supports. Yet even here there is an awkward interaction with the other protocols. Its useful to keep the peering connections with other routers as a fallback in case of controller failure, but they are not well integrated. We can only set precedences between SDN and BGP and hope for the best.

I do wonder if the existing implementation of routing protocols needs a more significant rethink. There is great value in retaining compatibility with the external interfaces: being able to peer with existing BGP/OSPF/etc nodes is a huge benefit. In contrast, there is little value to retaining the internal implementation choices inside the router. The existing protocols could be made to cooperate more flexibly with other inputs. More speculatively, extensions to the protocol itself could label routes which are expected to be overridden by another source, and only present as a fallback path.

Monday, November 14, 2011

The Computer is the Network

Modern commodity switch fabric chips are amazingly capable, but their functionality is not infinite. In particular their parsing engines are generally fixed function, extracting information from the set of headers they were designed to process. Similarly the ability to modify packets is constrained to specifically designed in protocols, not an infinitely programmable rewrite engine.

Software defined networks are a wonderful thing, but development of an SDN agent to drive an existing ASIC does not suddenly make it capable of packet handling it wasn't already designed to do. At best, it might expose functions of which the hardware was always capable but had not been utilized by the older software. Yet even that is questionable: once a platform goes into production, the expertise necessary to thoroughly test and develop bug workarounds for ASIC functionality rapidly disperses to work on new designs. If part of the functionality isn't ready at introduction it is often removed from the documentation and retargeted as a feature of the next chip.


Decisions at the Edge

MPLS networks have an interesting philosophy: the switching elements at the core are conceptually simple, driven by a label stack prepended to the packet. Decisions are made at the edge of the network wherever possible. The core switches may have complex functionality dealing with fast reroutes or congestion management, but they avoid having to re-parse the payloads and make new forwarding decisions.

Ethernet switches have mostly not followed this philosophy, in fact we've essentially followed the opposite path. We've tended to design in features and capacity at the same time. Larger switch fabrics with more capacity also tend to have more features. Initially this happened because a chip with more ports required a larger silicon die to have room for all of the pins. Thus, there was more room for digital logic. Vendors have accentuated this in their marketing plans, omitting software support for features in "low end" edge switches even if they use the same chipset as the more featureful aggregation products.

This leaves software defined networking in a bit of a quandary. The MPLS model is simpler to reason about for large collections of switches, you don't have a combinatorial explosion of decision-making at each hop in the forwarding. Yet non-MPLS Ethernet switches have mostly not evolved in that way, and the edge switches don't have the capability to make all of the decisions for behaviors we might want.


Software Switches to the Rescue

A number of market segments have gradually moved to a model where the first network element to touch the packet is implemented mostly in software. This allows the hope of substantially increasing their capability. A few examples:

vswitch running in the Hypervisor

Datacenters: The first hop is a software switch running in the Hypervisor, like the VMware vSwitch or Cisco Nexus 1000v.

WAN Optimizer with 4 CPUs

Wide Area Networks: WAN optimizers have become quite popular because they save money by reducing the amount of traffic sent over the WAN. These are mostly software products at this point, implementing protocol-specific compression and deduplication. Forthcoming 10 Gig products from Infineta appear to be the first products containing significant amounts of custom hardware.

Wifi AP with CPU, Wifi MAC, and Ethernet MAC

Wifi Access Points: Traditional, thick APs as seen in the consumer and carrier-provided equipment market are a CPU with Ethernet and Wifi, forwarding packets in software.
Thin APs for Enterprise use as deployed by Aruba/Airespace/etc are rather different, the real forwarding happens in hardware back at a central controller.

Cable modem with DOCSIS and Ethernet

Carrier Network Access Units: Like Wifi APs, access gear for DSL and DOCSIS networks is usually a CPU with the appropriate peripherals and forwards frames in software.

Enterprise switch with CPU handling all packets, and a big red X through it

Enterprise: Just kidding, the Enterprise is still firmly in the "more hardware == more better" category. Most of the problems to be solved in Enterprise networking today deal with access control, security, and malware containment. Though CPU forwarding at the edge is one solution to that (attempted by ConSentry and Nevis, among others), the industry mostly settled on out of band approaches.


The Computer is the Network

The Sun Microsystems tagline through most of the 1980s was The Network is the Computer. At the time it referred to client-server computing like NFS and RPC, though the modern web has made this a reality for many people who spend most of their computing time with social and communication applications via the web. Its a shame that Sun itself didn't live to see the day.

We're now entering an era where the Computer is the Network. We don't want to depend upon the end-station itself to mark its packets appropriately, mainly due to security and malware considerations, but we want the flexibility of having software touch every packet. Market segments which provide that capability, like datacenters, WAN connections, and even service providers, are going to be a lot more interesting in the next several years.