Friday, August 26, 2011

QFabric Conclusion

This is the fourth and final article in a series exploring the Juniper QFabric. Earlier articles provided an overview, a discussion of link speed, and musings on flow control. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article attempts to cover a few loose ends, and wraps up the series.

As with previous days, the flow control post sparked an interesting discussion on Google+.


Whither Protocols?

Director connected to edge node and peered with another switchFor a number of years switch and router manufacturers competed on protocol support, implementing various extensions to OSPF/BGP/SpanningTree/etc in their software. QFabric is almost completely silent about protocols. In part this is a marketing philosophy: positioning the QFabric as a distributed switch instead of a network means that the protocols running within the fabric are an implementation detail, not something to talk about. I don't know what protocols are run between the nodes of the QFabric, but I'm sure its not Spanning Tree and OSPF.

Yet QFabric will need to connect to other network elements at its edge, where the datacenter connects to the outside world. Presumably the routing protocols it needs are implemented in the QF/Director and piped over to whichever switch ports connect to the rest of the network. If there are multiple peering points, they need to communicate with the same entity and a common routing information base.


Flooding Frowned Upon

The edge Nodes have an L2 table holding 96K MAC addresses. This reinforces the notion that switching decisions are made at the ingress edge, every Node can know how to reach destination MAC addresses at every port. There are a few options for distributing MAC address information to all of the nodes, but I suspect that flooding unknown addresses to all ports is not the preferred mechanism. If flooding is allowed at all, it would be carefully controlled.

Much of modern datacenter design revolves around virtualization. The VMWare vCenter (or equivalent) is a single, authoritative source of topology information for virtual servers. By hooking to the VM management system, the QFabric Director could know the expected port and VLAN for each server MAC address. The Node L2 tables could be pre-populated accordingly.

By hooking to the VM management console QFabric could also coordinate VLANs, flow control settings, and other network settings with the virtual switches running in software.


NetOps Force Multiplier

Where previously network engineers would be configuring dozens of switches, QFabric now proposes to manage a single distributed switch. Done well, this should be a substantial time saver. There will of course be cases where the abstraction leaks and the individual Nodes have to be dealt with. The failure modes in a distributed switch are simply different. Its unlikely that a single line card within a chassis will unexpectedly lose power, but its almost certain that Nodes occasionally will. Nonetheless, the cost to operate QFabric seems promising.


Conclusion

QFabric is an impressive piece of work, clearly the result of several years effort. Though the Interconnects use merchant silicon, Juniper almost certainly started working with the manufacturer at the start of the project to ensure the chip would meet their needs.

The most interesting part of QFabric is its flow control mechanism, for which Juniper has made some pretty stunning claims. A flow control mechanism with fairness, no packet loss, and quick reaction to changes over such a large topology is an impressive feat.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.


Thursday, August 25, 2011

QFabric: Flow Control Considered Harmful

This is the third in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

Yesterday's post sparked an excellent discussion on Google+, disagreeing about overclocking of backplane links and suggesting LACP hashing as a reason for the requirement for 40 Gbps links. Ken Duda is a smart guy.


XON/XOFF

Ethernet includes (optional) link level flow control. An Ethernet device can signal its neighbor to stop sending traffic, requiring the neighbor to buffer packets until told it can resume sending. This flow control mechanism is universally implemented in the MAC hardware, not requiring software intervention on every flow control event.

Switch chip with one congested port flow controlling the entire linkThe flow control is for the whole link. For a switch with multiple downstream ports, there is no way to signal back to the sender that only some of the ports are congested. In this diagram, a single congested port requires the upstream to be flow controlled, even though the other port could accept more packets. Ethernet flow control suffers from head of line blocking, a single congested port will choke off traffic to other uncongested ports.

IEEE has tried to address this in two ways. Priority-based flow control, defined in IEEE 802.1Qbb, makes each of the eight classes of service flow control independently. Bulk data traffic would no longer block VoIP, for example. IEEE 802.1au is defining a congestion notification capability to send an explicit notification to compatible endstations, asking them to slow down. Per-priority pause is already appearing in some products. Congestion notification involves NICs and client operating systems, and is slower going.


QFabric

Juniper's whitepaper on QFabric has this to say about congestion:

Finally, the only apparent congestion is due to the limited bandwidth of ingress and egress interfaces and any congestion of egress interfaces does not affect ingress interfaces sending to non-congested interfaces; this noninterference is referred to as “non-blocking”

QFabric very clearly does not rely on Ethernet flow control for the links between edge nodes and interconnect, not even per-priority. It does something else, something which can reflect the queue state of an egress port on the far side of the fabric back to control the ingress. However, Juniper has said nothing that I can find about how this works. They rightly consider it part of their competitive advantage.

So... lets make stuff up. How might this be done?

Juniper has said that the Interconnect uses merchant switch silicon, but hasn't said anything about the edge nodes. As all the interesting stuff is done at the edge and Juniper has a substantial ASIC development capability, I'd assume they are using their own silicon there. Whatever mechanism they use for flow control would be implemented in the Nodes.

Most of the mechanisms I can think of require substantial buffering. QFX3500, the first QFabric edge node, has 4x40 Gbps ports connecting to the Interconnect. That is 20 gigabytes per second. A substantial bank of SRAM could buffer that traffic for a fraction of a second. For example, 256 Megabytes could absorb a bit over 13 milliseconds worth of traffic. 13 milliseconds provides a lot of time for real-time software to swing into action.

For example:

  • The memory could be used for egress buffering at the edge ports. Each Node could report its buffer status to all other nodes in the QFabric, several thousand times per second. Adding a few thousand packets per second from each switch is an insignificant load on the fabric. From there we can imagine a distributed flow control mechanism, which several thousand times per second would re-evaluate how much traffic it is allowed to send to each remote node in the fabric. Ethernet flow control frames could be sent to the sending edge port to slow it down.

    Or rather, smarter people than me can imagine how to construct a distributed flow control mechanism. My brain hurts just thinking about it.
  • Hardware counters could track the rate each Node is receiving traffic from each other Node, and software could report this several times per second. Part of the memory could be used as uplink buffering, with traffic shapers controlling the rate sent to each remote Node. Software could adjust the traffic shaping several thousand times per second to achieve fairness.

    Substantial uplink buffering also helps with oversubscribed uplinks. QFX3500 has 3:1 oversubscription.

I'll reiterate that the bullet points above are not how QFabric actually works, I made them up. Juniper isn't revealing how QFabric flow control works. If it achieves the results claimed it is a massive competitive advantage. I'm listing these to illustrate a point: a company willing to design and support its own switch silicon can take a wholly different approach from the rest of the industry. In the best case, they end up with an advantage for several years. In the worst case, they don't keep up with merchant silicon. I've been down both of those paths, the second is not fun.

There are precedents for the kind of systems described above: Infiniband reports queue information to neighbor IB nodes, and from this works out a flow control policy. Infiniband was designed in an earlier age when software wouldn't be able to handle the strict timing requirements, so it is defined in terms of hardware token buckets.

One thing I am sure of is that the QF/Director does not get involved in flow control. That would be a scalability and reliability problem. Juniper stated that the QFabric will continue to forward traffic even if the Director nodes have failed, the flow control mechanism cannot rely on the Directors.

Next article: wrap-up.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.


Wednesday, August 24, 2011

The Backplane Goes To Eleven

This is the second in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.


How Fast is Fast?

Ethernet link speeds are rigidly specified by the IEEE 802.3 standards, for example at 1/10/40/100/etc Gbps. It is important that Ethernet implementations from different vendors be able to interoperate.

Backplane links face fewer constraints, as there is little requirement for interoperability between implementations. Even if one wanted to it isn't possible to plug a card from one vendor into another vendor's chassis, they simply don't fit. Therefore backplane links within a chassis have been free to tweak their links speeds for better performance. In the 10 Gbps generation of products backplane links have typically run at 12.5 Gbps. In the 40 Gbps Ethernet generation I'd expect 45 or 50 Gbps backplane links (I don't really know, I no longer work in that space). A well-designed SERDES for a particular speed will have a bit of headroom, enabling faster operation over high quality links like a backplane. Running them faster tends to be possible without heroic effort.

Switch chip with 48 x 1Gbps downlinks and 4 x 10/12.5 Gbps uplinksA common spec for switch silicon in the 1Gbps generation is 48 x 1 Gbps ports plus 4 x 10 Gbps. Depending on the product requirements, 10 Gbps ports can be used for server attachment or as uplinks to build into a larger switch. At first glance the chassis application appears to be somewhat oversubscribed, with 48 Gbs of downlink but only 40 Gbps of uplink. In reality, when used in a chassis the uplink ports will run at 12.5 Gbps to get 50 Gbps of uplink bandwidth.

Though improved throughput and capacity is a big reason for running the backplane links faster, there is a more subtle benefit as well. Ethernet links are specified to run at some nominal clock frequency, modulo a tolerance measured in parts per million. The crystals driving the links can be slightly on the high or low side of the desired frequency yet still be within spec. If the backplane links happen to run slightly below nominal while the front panel links are slightly higher, the result would be occasional packet loss simply because the backplane cannot keep up. When a customer notices packet loss during stress tests in their lab it is very, very difficult to convince them it is expected and acceptable. Running the backplane links at a faster rate avoids the problem entirely, the backplane can always absorb line rate traffic from a front panel port.


QFabric

This is one area where QFabric takes a different tack from a modular chassis: I doubt the links between QF/Node and Interconnect are overclocked even slightly. In QFabric those links are not board traces they are lasers, and lasers are more picky about their data rate. In the Packet Pushers podcast the Juniper folks stated that the uplinks are required to be faster than the downlinks. That is, if the edge connections are 10 Gbps then the uplinks from Node to Interconnect have to be 40 Gbps. One reason for this is to avoid overruns due to clock frequency variance, and ensure the Interconnect can always absorb the link rate from an edge port without resorting to flow control.

Next article: flow control.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.


Tuesday, August 23, 2011

Making Stuff Up About QFabric

This is the first of several articles about the Juniper QFabric. I have not been briefed by Juniper, nor received any NDA information. These articles are written based on Juniper's public statements and materials available on the web, supplemented with details from a special Packet Pushers podcast, and topped off with a healthy amount of speculation and guessing about how it works.

Juniper QFabric with Nodes, Interconnect, and DirectorQFabric consists of edge nodes wired to two or four extremely large QF/Interconnect chassis, all managed via out of band links to QF/Directors. Juniper emphasizes that the collection of nodes, interconnects, and directors should be thought of as a single distributed switch rather than as a network. Packet handling within the QFabric is intended to be opaque, and the individual nodes are not separately configured. It is supposed to behave like one enormous, geographically distributed switch.

Therefore to try to brainstorm about how the distributed QFabric works we should think in terms of how a modular switch works, and how its functions might be distributed.


Exactly One Forwarding Decision

Ingress line card with switch fabric, connected to central supervisor with fabric, connected to egress line card with fabricModular Ethernet switches have line cards which can switch between ports on the card, with fabric cards (also commonly called supervisory modules, route modules, or MSMs) between line cards. One might assume that each level of switching would function like we expect Ethernet switches to work, forwarding based on the L2 or L3 destination address. There are a number of reasons why this doesn't work very well, most troublesome of which are the consistency issues. There is a delay between when a packet is processed by the ingress line card and the fabric, and between the fabric and egress. The L2 and L3 tables can change between the time a packet hits one level of switching and the next, and its very, very hard to design a robust switching platform with so many corner cases and race conditions to worry about.

Control header prepended to frameTherefore all Ethernet switch silicon I know of relies on control headers prepended to the packet. A forwarding decision is made at exactly one place in the system, generally either the ingress line card or the central fabric cards. The forwarding decision includes any rewrites or tunnel encapsulations to be done, and determines the egress port. A header is prepended to the packet for the rest of its trip through the chassis, telling all remaining switch chips what to do with it. To avoid impacting the forwarding rate, these headers replace part of the Ethernet preamble.

Control header prepended to frameGenerally the chips are configured to use these prepended control headers only on backplane links, and drop the header before the packet leaves the chassis. There are some exceptions where control headers are carried over external links to another box. Several companies sell variations on the port extender, a set of additional ports to be controlled remotely by a chassis switch. The link to the port extender will carry the control headers which would otherwise be restricted to the backplane. Similarly, several vendors sell stackable switches. Each unit in the stack can function as an independent switch, but can be connected via stack ports on the back to function together like a larger switch. The stack ports carry the prepended control headers from one stack member to the next, so the entire collection can function like a single forwarding plane.


QFabric

In the Packet Pushers podcast and in an article on EtherealMind, the Interconnect is described as a Clos network with stages of the Clos implemented in cards in the front and back of the chassis. It is implemented using merchant silicon, not Juniper ASICs. The technology in the edge Node was not specified, it is my assumption that Juniper uses its own silicon there.

Forwarding decisions are made in the Nodes and sent to the Interconnect, which is is a pure fabric with no decision making. This would be implemented by having the Nodes send control headers on their uplinks, in a format compatible with whatever merchant silicon is used in the Interconnect plus additional information needed to support the QFabric features. Juniper would not allow themselves to be locked in to a particular chip supplier, I'm sure the QF/Node implementation would be very flexible in how it creates those headers. A new QF/Interconnect with a different chipset would be supportable via a firmware upgrade to the edge nodes.

The QF/Interconnect would in turn forward the packet to its destination with the control header intact. The destination switch would perform whatever handling was indicated in the control information, discard the extra header, and forward the packet out the egress port.


Oversubscribed QF/Nodes

One interesting aspect of the first generation QF/Node is that it is oversubscribed. The QFX3500 has 480 Gbps of downlink capacity, in the form of 48 x 10G ports. It has 160 Gbps of uplink, via 4 x 40Gbps ports. Oversubscribed line cards are not unheard of in module chassis architectures, though it is generally the result of a followon generation of cards outstrips the capacity of the backplane. There have been chassis designs where the line cards were deliberately oversubscribed, but they are somewhat less common.

QFabric has a very impressive system to handle congestion and flow control, which will be the topic of a future article. The oversubscribed uplink is a slightly different variation, but in the end is really another form of congestion for the fabric to deal with. It would buffer what it can, and assert Ethernet flow control and/or some other means of backpressure to the edge ports if necessary.

Next article: link speeds.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.


Sunday, August 21, 2011

Consistency At All Levels

Twitter now wraps all links passing through the service with the t.co link shortener, which I wondered about a little while ago. Twitter engineers sweat the details, even the HTTP response headers are concise:

$ curl --head http://t.co/WXZtRHC
HTTP/1.1 301 Moved Permanently
Date: Sun, 21 Aug 2011 12:17:43 GMT
Server: hi
Location: http://codingrelic.geekhold.com/2011/07/tweetalytics.html
Cache-Control: private,max-age=300
Expires: Sun, 21 Aug 2011 12:22:43 GMT
Connection: close
Content-Type: text/html; charset=UTF-8

Oh, hi. Um, how are you?


Monday, August 15, 2011

An Awkward Segue to CPU Caching

Last week Andy Firth published The Demise of the Low Level Programmer, expressing dismay over the lack of low level systems knowledge displayed by younger engineers in the console game programming field. Andy's particular concerns deal with proper use of floating versus fixed point numbers, CPU cache behavior and branch prediction, bit manipulation, etc.

I have to admit a certain sympathy for this position. I've focussed on low level issues for much of my career. As I'm not in the games space, the specific topics I would offer differ somewhat: cache coherency with I/O, and page coloring, for example. Nonetheless, I feel a certain solidarity.

Yet I don't recall those topics being taught in school. I had classes which covered operating systems and virtual memory, but distinctly remember being shocked at the complications the first time I encountered a system which mandated page coloring. Similarly though I had a class on assembly programming, by the time I actually needed to work at that level I had to learn new instruction sets and many techniques.

In my experience at least, schools never did teach such topics. This stuff is learned by doing, as part of a project or on the job. The difference now is that fewer programmers are learning it. Its not because programmers are getting worse. I interview a lot of young engineers, their caliber is as high as I have ever experienced. It is simply that computing has grown a great deal in 20 years, there are a lot more topics available to learn, and frankly the cutting edge stuff has moved on. Even in the gaming space which spurred Andy's original article, big chunks of the market have been completely transformed. Twenty years ago casual gaming meant Game Boy, an environment so constrained that heroic optimization efforts were required. Now casual gaming means web based games on social networks. The relevant skill set has changed.

I'm sure Andy Firth is aware of the changes in the industry. Its simply that we have a tendency to assume that markets where there is a lot of money being made will inevitably attract new engineers, and so there should be a steady supply of new low level programmers for consoles. Unfortunately I don't believe that is true. Markets which are making lots of money don't attract young engineers. Markets which are perceived to be growing do, and other parts of the gaming market are perceived to be growing faster.


 

Page Coloring

Least significant bits as cache line offset, next few bits as cache indexBecause I brought it up earlier, we'll conclude with a discussion of page coloring. I am not satisfied with the Wikipedia page, which seems to have paraphrased a FreeBSD writeup describing page coloring as a performance issue. In some CPUs, albeit not current mainstream CPUs, coloring isn't just a performance issue. It is essential for correct operation.


Cache Index

Least significant bits as cache line offset, next few bits as cache indexBefore fetching a value from memory the CPU consults its cache. The least significant bits of the desired address are an offset into the cache line, generally 4, 5, or 6 bits for a 16/32/64 byte cache line.

The next few bits of the address are an index to select the cache line. It the cache has 1024 entries, then ten bits would be used as the index. Things get a bit more complicated here due to set associativity, which lets entries occupy several different locations to improve utilization. A two way set associative cache of 1024 entries would take 9 bits from the address and then check two possible locations. A four way set associative cache would use 8 bits. Etc.


Page tag

Least significant bits as page offset, upper bits as page tagSeparately, the CPU defines a page size for the virtual memory system. 4 and 8 Kilobytes are common. The least significant bits of the address are the offset within the page, 12 or 13 bits for 4 or 8 K respectively. The most significant bits are a page number, used by the CPU cache as a tag. The hardware fetches the tag of the selected cache lines to check against the upper bits of the desired address. If they match, it is a cache hit and no access to DRAM is needed.

To reiterate: the tag is not the remaining bits of the address above the index and offset. The bits to be used for the tag are determined by the page size, and not directly tied to the details of the CPU cache indexing.


Virtual versus Physical

In the initial few stages of processing the load instruction the CPU has only the virtual address of the desired memory location. It will look up the virtual address in its TLB to get the physical address, but using the virtual address to access the cache is a performance win: the cache lookup can start earlier in the CPU pipeline. Its especially advantageous to use the virtual address for the cache index, as that processing happens earlier.

The tag is almost always taken from the physical address. Virtual tagging complicates shared memory across processes: the same physical page would have to be mapped at the same virtual address in all processes. That is an essentially impossible requirement to put on a VM system. Tag comparison happens later in the CPU pipeline, when the physical address will likely be available anyway, so it is (almost) universally taken from the physical address.

This is where page coloring comes into the picture.


 

Virtually Indexed, Physically Tagged

From everything described above, the size of the page tag is independent of the size of the cache index and offset. They are separate decisions, and frankly the page size is generally mandated. It is kept the same for all CPUs in a given architectural family even as they vary their cache implementations.

Consider then, the impact of a series of design choices:

  • 32 bit CPU architecture
  • 64 byte cache line: 6 bits of cache line offset
  • 8K page size: 19 bits of page tag, 13 bits of page offset
  • 512 entries in the L1 cache, direct mapped. 9 bits of cache index.
  • virtual indexing, for a shorter CPU pipeline. Physical tagging.
  • write back
Virtually indexed, physically tagged, with 2 bits of page color

What does this mean? It means the lower 15 bits of the virtual address and the upper 19 bits of the physical address are referenced while looking up items in the cache. Two of the bits overlap between the virtual and physical addresses. Those two bits are the page color. For proper operation, this CPU requires that all processes which map in a particular page do so at the same color. Though in theory the page could be any color so long as all mappings are the same, in practice the virtual color bits are set the same as the underlying physical page.

The impact of not enforcing page coloring is dire. A write in one process will be stored in one cache line, a read from another process will access a different cache line.

Page coloring like this places quite a burden on the VM system, and one which would be difficult to retrofit into an existing VM implementation. OS developers would push back against new CPUs which proposed to require coloring, and you used to see CPU designs making fairly odd tradeoffs in their L1 cache because of it. HP PA-RISC used a very small (but extremely fast) L1 cache. I think they did this to use direct mapped virtual indexing without needing page coloring. There were CPUs with really insane levels of set associativity in the L1 cache, 8 way or even 16 way. This reduced the number of index bits to the point where a virtual index wouldn't require coloring.