Tuesday, August 23, 2011

Making Stuff Up About QFabric

This is the first of several articles about the Juniper QFabric. I have not been briefed by Juniper, nor received any NDA information. These articles are written based on Juniper's public statements and materials available on the web, supplemented with details from a special Packet Pushers podcast, and topped off with a healthy amount of speculation and guessing about how it works.

Juniper QFabric with Nodes, Interconnect, and DirectorQFabric consists of edge nodes wired to two or four extremely large QF/Interconnect chassis, all managed via out of band links to QF/Directors. Juniper emphasizes that the collection of nodes, interconnects, and directors should be thought of as a single distributed switch rather than as a network. Packet handling within the QFabric is intended to be opaque, and the individual nodes are not separately configured. It is supposed to behave like one enormous, geographically distributed switch.

Therefore to try to brainstorm about how the distributed QFabric works we should think in terms of how a modular switch works, and how its functions might be distributed.

Exactly One Forwarding Decision

Ingress line card with switch fabric, connected to central supervisor with fabric, connected to egress line card with fabricModular Ethernet switches have line cards which can switch between ports on the card, with fabric cards (also commonly called supervisory modules, route modules, or MSMs) between line cards. One might assume that each level of switching would function like we expect Ethernet switches to work, forwarding based on the L2 or L3 destination address. There are a number of reasons why this doesn't work very well, most troublesome of which are the consistency issues. There is a delay between when a packet is processed by the ingress line card and the fabric, and between the fabric and egress. The L2 and L3 tables can change between the time a packet hits one level of switching and the next, and its very, very hard to design a robust switching platform with so many corner cases and race conditions to worry about.

Control header prepended to frameTherefore all Ethernet switch silicon I know of relies on control headers prepended to the packet. A forwarding decision is made at exactly one place in the system, generally either the ingress line card or the central fabric cards. The forwarding decision includes any rewrites or tunnel encapsulations to be done, and determines the egress port. A header is prepended to the packet for the rest of its trip through the chassis, telling all remaining switch chips what to do with it. To avoid impacting the forwarding rate, these headers replace part of the Ethernet preamble.

Control header prepended to frameGenerally the chips are configured to use these prepended control headers only on backplane links, and drop the header before the packet leaves the chassis. There are some exceptions where control headers are carried over external links to another box. Several companies sell variations on the port extender, a set of additional ports to be controlled remotely by a chassis switch. The link to the port extender will carry the control headers which would otherwise be restricted to the backplane. Similarly, several vendors sell stackable switches. Each unit in the stack can function as an independent switch, but can be connected via stack ports on the back to function together like a larger switch. The stack ports carry the prepended control headers from one stack member to the next, so the entire collection can function like a single forwarding plane.


In the Packet Pushers podcast and in an article on EtherealMind, the Interconnect is described as a Clos network with stages of the Clos implemented in cards in the front and back of the chassis. It is implemented using merchant silicon, not Juniper ASICs. The technology in the edge Node was not specified, it is my assumption that Juniper uses its own silicon there.

Forwarding decisions are made in the Nodes and sent to the Interconnect, which is is a pure fabric with no decision making. This would be implemented by having the Nodes send control headers on their uplinks, in a format compatible with whatever merchant silicon is used in the Interconnect plus additional information needed to support the QFabric features. Juniper would not allow themselves to be locked in to a particular chip supplier, I'm sure the QF/Node implementation would be very flexible in how it creates those headers. A new QF/Interconnect with a different chipset would be supportable via a firmware upgrade to the edge nodes.

The QF/Interconnect would in turn forward the packet to its destination with the control header intact. The destination switch would perform whatever handling was indicated in the control information, discard the extra header, and forward the packet out the egress port.

Oversubscribed QF/Nodes

One interesting aspect of the first generation QF/Node is that it is oversubscribed. The QFX3500 has 480 Gbps of downlink capacity, in the form of 48 x 10G ports. It has 160 Gbps of uplink, via 4 x 40Gbps ports. Oversubscribed line cards are not unheard of in module chassis architectures, though it is generally the result of a followon generation of cards outstrips the capacity of the backplane. There have been chassis designs where the line cards were deliberately oversubscribed, but they are somewhat less common.

QFabric has a very impressive system to handle congestion and flow control, which will be the topic of a future article. The oversubscribed uplink is a slightly different variation, but in the end is really another form of congestion for the fabric to deal with. It would buffer what it can, and assert Ethernet flow control and/or some other means of backpressure to the edge ports if necessary.

Next article: link speeds.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.