Monday, November 28, 2011

QFabric Followup

In August this site published a series of posts about the Juniper QFabric. Since then Juniper has released hardware documentation for the QFabric components, so its time for a followup.

QF edge Nodes, Interconnects, and DirectorsQFabric consists of Nodes at the edges wired to large Interconnect switches in the core. The whole collection is monitored and managed by out of band Directors. Juniper emphasizes that the QFabric should be thought of as a single distributed switch, not as a network of individual switches. The entire QFabric is managed as one entity.

Control header prepended to frameThe fundamental distinction between QFabric and conventional switches is in the forwarding decision. In a conventional switch topology each layer of switching looks at the L2/L3 headers to figure out what to do. The edge switch sends the packet to the distribution switch, which examines the headers again before sending the packet on towards the core (which examines the headers again). QFabric does not work this way. QFabric functions much more like the collection of switch chips inside a modular chassis: the forwarding decision is made by the ingress switch and is conveyed through the rest of the fabric by prepending control headers. The Interconnect and egress Node forward the packet according to its control header, not via another set of L2/L3 lookups.


 

Node Groups

The Hardware Documentation describes two kinds of Node Groups, Server and Network, which gather multiple edge Nodes together for common purposes.

  • Server Node Groups are straightforward: normally the edge Nodes are independent, connecting servers and storage to the fabric. Pairs of edge switches can be configured as Server Node Groups for redundancy, allowing LAG groups to span the two switches.
  • Network Node Groups configure up to eight edge Nodes to interconnect with remote networks. Routing protocols like BGP or OSPF run on the Director systems, so the entire Group shares a common Routing Information Base and other data.

Why have Groups? Its somewhat easier to understand the purpose of the Network Node Group: routing processes have to be spun up on the Directors, and perhaps those processes have to point to some distinct entity to operate with. Why have Server Node Groups, though? Redundant server connections are certainly beneficial, but why require an additional fabric configuration to allow it?

Ingress fanout to four LAG member portsI don't know the answer, but I suspect it has to do with Link Aggregation (LAG). Server Node Groups allow a LAG to be configured using ports spanning the two Nodes. In a chassis switch, LAG is handled by the ingress chip. It looks up the destination address to find the destination port. Every chip knows the membership of all LAGs in the chassis. The ingress chip computes a hash of the packet to pick which LAG member port to send the packet to. This is how LAG member ports can be on different line cards, the ingress port sends it to the correct card.

Ingress fanout to four LAG member portsThe downside of implementing LAG at ingress is that every chip has to know the membership of all LAGs in the system. Whenever a LAG member port goes down, all chips have to be updated to stop using it. With QFabric, where ingress chips are distributed across a network and the largest fabric could have thousands of server LAG connections, updating all of the Nodes whenever a link goes down could take a really long time. LAG failure is supposed to be quick, with minimal packet loss when a link fails. Therefore I wonder if Juniper has implemented LAG a bit differently, perhaps by handling member port selection in the Interconnect, in order to minimize the time to handle a member port failure.

I feel compelled to emphasize again: I'm making this up. I don't know how QFabric is implemented nor why Juniper made the choices they made. Its just fun to speculate.


 

Virtualized Junos

Regarding the Director software, the Hardware Documentation says, "[Director devices] run the Junos operating system (Junos OS) on top of a CentOS foundation." Now that is an interesting choice. Way, way back in the mists of time, Junos started from NetBSD as its base OS. NetBSD is still a viable project and runs on modern x86 machines, yet Juniper chose to hoist Junos atop a Linux base instead.

I suspect that in the intervening time, the Junos kernel and platform support diverged so far from NetBSD development that it became impractical to integrate recent work from the public project. Juniper would have faced a substantial effort to handle modern x86 hardware, and chose instead to virtualize the Junos kernel in a VM whose hardware was easier to support. I'll bet the CentOS on the Director is the host for a Xen hypervisor.

Update: in the comments, Brandon Bennett and Julien Goodwin both note that Junos used FreeBSD as its base OS, not NetBSD.

Aside: with network OSes developed in the last few years, companies have tended to put effort into keeping the code portable enough to run on a regular x86 server. The development, training, QA, and testing benefits of being able to run on a regular server are substantial. That means implementing a proper hardware abstraction layer to handle running on a platform which doesn't have the fancy switching silicon. In the 1990s when Junos started, running on x86 was not common practice. We tended to do development on Sparcstations, DECstations, or some other fancy RISC+Unix machine and didn't think much about Intel. The RISC systems were so expensive that one would never outfit a rack of them for QA, it was cheaper to build a bunch of switches instead.

Aside, redux: Junosphere also runs Junos as a virtual machine. In a company the size of Juniper these are likely to have been separate efforts, which might not even have known about each other at first. Nonetheless the timing of the two products is close enough that there may have been some cross-group pollination and shared underpinnings.


 

Misc Notes

  • The Director communicates with the Interconnects and Nodes via a separate control network, handled by Juniper's previous generation EX4200. This is an example of using a simpler network to bootstrap and control a more complex one.
  • QFX3500 has four QSFPs for 40 gig Ethernet. These can each be broken out into four 10G Ethernet ports, except the first one which supports only three 10G ports. That is fascinating. I wonder what the fourth one does?

Thats all for now. We may return to QFabric as it becomes more widely deployed or as additional details surface.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.