Coding Relic

Saturday, September 17, 2011

VXLAN Part Deux

This is the second of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

I strongly recommend reading the first article before this one, to provide background.

UDP Encapsulation

In addition to the IP tunnel header and a VXLAN header, there is also an Outer UDP header. One might reasonably ask why it is there, as VXLAN could have been directly encapsulated within IP.

Four paths between routers, hashing headers chooses one. The UDP header serves an interesting purpose, it isn't there to perform the multiplexing role UDP normally serves. When switches have multiple paths available to a destination, whether an L2 trunk or L3 multipathing, the specific link is chosen by hashing packet headers. Most switch hardware is quite limited in how it computes the hash: the outermost L2/L3/L4 headers. Some chips can examine the inner headers of long-established tunneling protocols like GRE/MAC-in-MAC/IP-in-IP. For a new protocol like VXLAN, it would take years for silicon support for the inner headers to become common.

Therefore the VTEP calculates a hash of the inner packet headers, and places it in the source UDP port where it feeds into LACP/ECMP hash calculation. Existing switch chips get proper load balancing using only the Outer L2/L3/L4 headers, at the cost of 8 bytes of overhead.

VTEP calculates hash of inner packet headers, places it in the UDP source port.

New protocols sometimes encapsulate themselves inside UDP headers to more easily traverse firewalls and NAT devices. That isn't what VXLAN is doing, it would be somewhat ludicrous to put firewalls between subnets within a datacenter. In fact, the way VXLAN uses its UDP header can make firewall traversal a bit more challenging. The inner packet headers can hash to a well known UDP port number like 53, making it look like a DNS response, but a firewall attempting to inspect the contents of the frame will not find a valid DNS packet. It would be important to disable any deep packet inspection for packets traveling between VTEP endpoints. If VXLAN is used to extend an L2 network all the way across a WAN the firewall question becomes more interesting. I don't think its a good idea to have a VXLAN cross a WAN, but that will have to be a topic for another day.

VTEP Learning

VTEP Table of MAC:OuterIP mappings. The VTEP examines the destination MAC address of frames it handles, looking up the IP address of the VTEP for that destination. This MAC:OuterIP mapping table is populated by learning, very much like an L2 switch discovers the port mappings for MAC addresses. When a VM wishes to communicate with another VM it generally first sends a broadcast ARP, which its VTEP will send to the multicast group for its VNI. All of the other VTEPs will learn the Inner MAC address of the sending VM and Outer IP address of its VTEP from this packet. The destination VM will respond to the ARP via a unicast message back to the sender, which allows the original VTEP to learn the destination mapping as well.

When a MAC address moves, the other VTEPs find its new location by the same learning process, using the first packet they see from its new VTEP. Why might a MAC address move? Consider a protocol like VRRP, which fails over a MAC address between two redundant servers. When ownership of a VRRP MAC address switches from one VM to another, all of the other VTEPs on the network need to learn the new MAC:OuterIP association. VRRP typically sends a gratuitous ARP when it fails over, and as a broadcast packet that ARP will be sent to all VTEPs. They learn the new MAC:OuterIP association from that packet.

VRRP nicely sends a gratuitous ARP when the MAC address moves, but not all MAC moves will do so. Consider the case where a running VM is frozen and moved to another server. The VM will resume where it left off, its ARP table fully populated for nodes it is communicating with. It won't send a gratuitous ARP because the VM has no idea that it has moved to a new vserver, and it won't send ARPs for addresses already in its table either. Its possible I've missed some subtlety, but I don't see how remote VTEPs would quickly learn the new location of the MAC address. I think they continue sending to the incorrect VTEP until their entries time out, and then they start flooding to the VXLAN multicast address.

Multicast frame delivered to 3 VTEPs but dropped before reaching one. Though it is appealing to let VTEPs track each other automatically using multicast and learning, I suspect beyond a certain scale of network that isn't going to work very well. Multicast frames are not reliably delivered, and because they fan out to all nodes they tend to become ever less reliable as the number of nodes increases. The RFC mentions the possibility of other mechanisms to populate the VTEP tables, including centralized controllers. I suspect a controller will be an important product to offer. Troubleshooting why subsets of VMs transiently lose the ability to communicate after a move or failover would be really annoying. Small networks could rely on multicast, while larger networks could fall back to it if the controller fails.

Suggestions

On the off chance that people read this far, I'll offer a couple suggestions for modifications to the VXLAN specification based on discussion earlier in the article.

If VXLAN is used to connect remote facilities, it is likely to traverse firewalls. When the VTEP calculates a hash of the Inner headers to place in the UDP source port field, I'd recommend it always set the most significant bit. This restricts the hash to 15 bits, values 32768 - 65535, but avoids any low numbered port number with a defined meaning like DNS. This should still result in good LACP/ECMP hashing, as this makes VXLAN packets look like ephemeral ports used by UDP client applications.
When a VTEP sees a new source MAC address from a local VM, flood the packet even if the OuterIP of the destination is already known. This gives remote VTEPs a better chance of noticing a MAC move. The VTEP already had to keep track of local MAC addresses to properly deliver received frames, so I suspect there is already a local source learning function.

Next article: A few final VXLAN topics.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Friday, September 16, 2011

The Care and Feeding of VXLAN

This is the first of several articles about VXLAN. I have not been briefed by any of the companies involved, nor received any NDA information. These articles are written based on public statements and discussions available on the web.

Modern datacenter networks are designed around the requirements for virtualization. There are a number of physical servers connected to the network, each hosting a larger number of virtual servers. The Gentle Reader may be familiar with VMware on the desktop, with NAT to let VMs communicate with the Internet. Datacenter VMs don't work that way. Each VM has its own IP and MAC address, with software beneath the VMs to function as a switch. On the Ethernet we see a huge number of MAC addresses, from all of the VMs throughout the facility.

To rebalance server load it is necessary to support moving a virtual machine from a heavily loaded server to another more lightly loaded one. It is essentially impossible to change the IP address of a VM as part of this move. Though modern server OSes can change IP address without rebooting, doing so interrupts the service as client connections are closed and caches spread far and wide have to time out. To be useful, VM moves have to be mostly invisible and not interrupt services.

Datacenter networkmoving a VM from one physical server to another.

The most straightforward way to accomplish this is to have the servers sit in a single subnet and single broadcast domain, which leads to some truly enormous L2 networks. It is putting pressure on switch designs to support tens of thousands of MAC addresses (thankfully via hash engines, rather than CAMs). Everything we'd learned about networking to this point drove towards Layer3 networks for scalability, but it is all being rewritten for virtualized datacenters.

Enter VXLAN

At VMWorld several weeks ago there was an announcement of VXLAN, Virtual eXtensible LANs. I paid little attention at the time, but should have paid more. VXLAN is an encapsulation scheme to carry L2 frames atop L3 networks. Described like that it doesn't sound very interesting, but the details are well thought out. Additionally the RFC is authored by representatives from major virtualization vendors VMware, Citrix, and Red Hat, and by datacenter switch vendors Cisco and Arista. It will appear in real networks relatively quickly.

The bulk of the VXLAN implementation is handled via a tunneling endpoint. This will generally reside within the virtual server, in software running under the VMs. A new component called the VXLAN Tunnel End Point (VTEP) encapsulates frames inside an L3 tunnel. There can be 224 VXLANs, identified by a 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of known destination MAC addresses, and stores the IP address of the tunnel to the remote VTEP to use for each. Unicast frames between VMs are sent directly to the unicast L3 address of the remote VTEP. Multicast and broadcast frames from VMs are sent to a multicast IP group associated with the VNI. The spec is vague on how VNIs map to a multicast IP address, merely saying that a management plane configures it along with VM membership in the VXLAN. Multicast distribution in most networks is something of an afterthought, making the address configurable allows VXLAN to cope with whatever facilities exist.

Overlay network connecting VMs through an L3 switch.

A Maze of Twisty Little Passages

VXLAN encapsulates L2 packets from the VMs within an Outer IP header to send across an IP network. The receiving VTEP decapsulates the packet, and consults the Inner headers to figure out how to deliver it to its destination.

The encapsulated packet retains its Inner MAC header and optional Inner VLAN, but has no Inner CRC. When VMs send packets to other VMs within a server they do not calculate a CRC, one is added by a physical NIC when the packet leaves the server. As the VTEP is a software component within the server, prior to hitting any NIC, the frames have no CRC when the VTEP gets them. Therefore there is no integrity protection end to end, from the originating VM to receiving. This is another case where even on L2 networks, the Ethernet CRC does not work the way our intuition would suggest.

Next article: UDP

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, September 4, 2011

Telling Strangers Where You Are

foursquare Ten One Hundred badge, for a thousand checkins. Not quite two years ago in this space I wrote about how I use foursquare. I've continued using the service since then, passing 1,000 checkins several months ago.

The first generation of location based services like foursquare have paid a lot of attention to privacy concerns. Explicit connection to other users is required in order to allow them to see your checkins. To do otherwise would have been perceived as creepy, the go-to label for vague privacy concerns. For those who do want to make their checkins public, Foursquare has an option to publish checkins to Twitter.

Yet social norms evolve, even in the span of just two years. Facebook Places and Google+ both offer checkins as a feature of their respective services. I've been periodically checking in on Google+ for several months. For routine trips I check in to a very limited circle of people, not so much out of concern about privacy as to not be spammy. For well-known venues I've been checking in publicly, and something fascinating happens: well-known venues are really well-known. Lots of people have been there, and they chime in with commentary and suggestions of things to see and do. Our trip to the Monterey Bay Aquarium was much improved by real-time suggestions from Google+ users, and pictures from the trip in turn made a couple other people think about going back.

Jeff Jarvis has long made the argument about the benefits of publicness, and that overemphasizing concerns about privacy undermines the benefits we could get by being connected. We use nebulous terms in justifying privacy like creepy, and stifle discussion of the value of openness. Our brains are really good at concocting (unlikely) scenarios of the bad things which could happen from sharing information, and not so good at seeing the good which can come of it. I'm definitely seeing that effect with public checkins, it seems scary but yet there is tremendous value in sharing them widely.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the social label.

Friday, August 26, 2011

QFabric Conclusion

This is the fourth and final article in a series exploring the Juniper QFabric. Earlier articles provided an overview, a discussion of link speed, and musings on flow control. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article attempts to cover a few loose ends, and wraps up the series.

As with previous days, the flow control post sparked an interesting discussion on Google+.

Whither Protocols?

Director connected to edge node and peered with another switch For a number of years switch and router manufacturers competed on protocol support, implementing various extensions to OSPF/BGP/SpanningTree/etc in their software. QFabric is almost completely silent about protocols. In part this is a marketing philosophy: positioning the QFabric as a distributed switch instead of a network means that the protocols running within the fabric are an implementation detail, not something to talk about. I don't know what protocols are run between the nodes of the QFabric, but I'm sure its not Spanning Tree and OSPF.

Yet QFabric will need to connect to other network elements at its edge, where the datacenter connects to the outside world. Presumably the routing protocols it needs are implemented in the QF/Director and piped over to whichever switch ports connect to the rest of the network. If there are multiple peering points, they need to communicate with the same entity and a common routing information base.

Flooding Frowned Upon

The edge Nodes have an L2 table holding 96K MAC addresses. This reinforces the notion that switching decisions are made at the ingress edge, every Node can know how to reach destination MAC addresses at every port. There are a few options for distributing MAC address information to all of the nodes, but I suspect that flooding unknown addresses to all ports is not the preferred mechanism. If flooding is allowed at all, it would be carefully controlled.

Much of modern datacenter design revolves around virtualization. The VMWare vCenter (or equivalent) is a single, authoritative source of topology information for virtual servers. By hooking to the VM management system, the QFabric Director could know the expected port and VLAN for each server MAC address. The Node L2 tables could be pre-populated accordingly.

By hooking to the VM management console QFabric could also coordinate VLANs, flow control settings, and other network settings with the virtual switches running in software.

NetOps Force Multiplier

Where previously network engineers would be configuring dozens of switches, QFabric now proposes to manage a single distributed switch. Done well, this should be a substantial time saver. There will of course be cases where the abstraction leaks and the individual Nodes have to be dealt with. The failure modes in a distributed switch are simply different. Its unlikely that a single line card within a chassis will unexpectedly lose power, but its almost certain that Nodes occasionally will. Nonetheless, the cost to operate QFabric seems promising.

Conclusion

QFabric is an impressive piece of work, clearly the result of several years effort. Though the Interconnects use merchant silicon, Juniper almost certainly started working with the manufacturer at the start of the project to ensure the chip would meet their needs.

The most interesting part of QFabric is its flow control mechanism, for which Juniper has made some pretty stunning claims. A flow control mechanism with fairness, no packet loss, and quick reaction to changes over such a large topology is an impressive feat.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Thursday, August 25, 2011

QFabric: Flow Control Considered Harmful

This is the third in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

Yesterday's post sparked an excellent discussion on Google+, disagreeing about overclocking of backplane links and suggesting LACP hashing as a reason for the requirement for 40 Gbps links. Ken Duda is a smart guy.

XON/XOFF

Ethernet includes (optional) link level flow control. An Ethernet device can signal its neighbor to stop sending traffic, requiring the neighbor to buffer packets until told it can resume sending. This flow control mechanism is universally implemented in the MAC hardware, not requiring software intervention on every flow control event.

Switch chip with one congested port flow controlling the entire link The flow control is for the whole link. For a switch with multiple downstream ports, there is no way to signal back to the sender that only some of the ports are congested. In this diagram, a single congested port requires the upstream to be flow controlled, even though the other port could accept more packets. Ethernet flow control suffers from head of line blocking, a single congested port will choke off traffic to other uncongested ports.

IEEE has tried to address this in two ways. Priority-based flow control, defined in IEEE 802.1Qbb, makes each of the eight classes of service flow control independently. Bulk data traffic would no longer block VoIP, for example. IEEE 802.1au is defining a congestion notification capability to send an explicit notification to compatible endstations, asking them to slow down. Per-priority pause is already appearing in some products. Congestion notification involves NICs and client operating systems, and is slower going.

QFabric

Juniper's whitepaper on QFabric has this to say about congestion:

Finally, the only apparent congestion is due to the limited bandwidth of ingress and egress interfaces and any congestion of egress interfaces does not affect ingress interfaces sending to non-congested interfaces; this noninterference is referred to as “non-blocking”

QFabric very clearly does not rely on Ethernet flow control for the links between edge nodes and interconnect, not even per-priority. It does something else, something which can reflect the queue state of an egress port on the far side of the fabric back to control the ingress. However, Juniper has said nothing that I can find about how this works. They rightly consider it part of their competitive advantage.

So... lets make stuff up. How might this be done?

Juniper has said that the Interconnect uses merchant switch silicon, but hasn't said anything about the edge nodes. As all the interesting stuff is done at the edge and Juniper has a substantial ASIC development capability, I'd assume they are using their own silicon there. Whatever mechanism they use for flow control would be implemented in the Nodes.

Most of the mechanisms I can think of require substantial buffering. QFX3500, the first QFabric edge node, has 4x40 Gbps ports connecting to the Interconnect. That is 20 gigabytes per second. A substantial bank of SRAM could buffer that traffic for a fraction of a second. For example, 256 Megabytes could absorb a bit over 13 milliseconds worth of traffic. 13 milliseconds provides a lot of time for real-time software to swing into action.

For example:

The memory could be used for egress buffering at the edge ports. Each Node could report its buffer status to all other nodes in the QFabric, several thousand times per second. Adding a few thousand packets per second from each switch is an insignificant load on the fabric. From there we can imagine a distributed flow control mechanism, which several thousand times per second would re-evaluate how much traffic it is allowed to send to each remote node in the fabric. Ethernet flow control frames could be sent to the sending edge port to slow it down.

Or rather, smarter people than me can imagine how to construct a distributed flow control mechanism. My brain hurts just thinking about it.
Hardware counters could track the rate each Node is receiving traffic from each other Node, and software could report this several times per second. Part of the memory could be used as uplink buffering, with traffic shapers controlling the rate sent to each remote Node. Software could adjust the traffic shaping several thousand times per second to achieve fairness.

Substantial uplink buffering also helps with oversubscribed uplinks. QFX3500 has 3:1 oversubscription.

I'll reiterate that the bullet points above are not how QFabric actually works, I made them up. Juniper isn't revealing how QFabric flow control works. If it achieves the results claimed it is a massive competitive advantage. I'm listing these to illustrate a point: a company willing to design and support its own switch silicon can take a wholly different approach from the rest of the industry. In the best case, they end up with an advantage for several years. In the worst case, they don't keep up with merchant silicon. I've been down both of those paths, the second is not fun.

There are precedents for the kind of systems described above: Infiniband reports queue information to neighbor IB nodes, and from this works out a flow control policy. Infiniband was designed in an earlier age when software wouldn't be able to handle the strict timing requirements, so it is defined in terms of hardware token buckets.

One thing I am sure of is that the QF/Director does not get involved in flow control. That would be a scalability and reliability problem. Juniper stated that the QFabric will continue to forward traffic even if the Director nodes have failed, the flow control mechanism cannot rely on the Directors.

Next article: wrap-up.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, August 24, 2011

The Backplane Goes To Eleven

This is the second in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

How Fast is Fast?

Ethernet link speeds are rigidly specified by the IEEE 802.3 standards, for example at 1/10/40/100/etc Gbps. It is important that Ethernet implementations from different vendors be able to interoperate.

Backplane links face fewer constraints, as there is little requirement for interoperability between implementations. Even if one wanted to it isn't possible to plug a card from one vendor into another vendor's chassis, they simply don't fit. Therefore backplane links within a chassis have been free to tweak their links speeds for better performance. In the 10 Gbps generation of products backplane links have typically run at 12.5 Gbps. In the 40 Gbps Ethernet generation I'd expect 45 or 50 Gbps backplane links (I don't really know, I no longer work in that space). A well-designed SERDES for a particular speed will have a bit of headroom, enabling faster operation over high quality links like a backplane. Running them faster tends to be possible without heroic effort.

Switch chip with 48 x 1Gbps downlinks and 4 x 10/12.5 Gbps uplinks A common spec for switch silicon in the 1Gbps generation is 48 x 1 Gbps ports plus 4 x 10 Gbps. Depending on the product requirements, 10 Gbps ports can be used for server attachment or as uplinks to build into a larger switch. At first glance the chassis application appears to be somewhat oversubscribed, with 48 Gbs of downlink but only 40 Gbps of uplink. In reality, when used in a chassis the uplink ports will run at 12.5 Gbps to get 50 Gbps of uplink bandwidth.

Though improved throughput and capacity is a big reason for running the backplane links faster, there is a more subtle benefit as well. Ethernet links are specified to run at some nominal clock frequency, modulo a tolerance measured in parts per million. The crystals driving the links can be slightly on the high or low side of the desired frequency yet still be within spec. If the backplane links happen to run slightly below nominal while the front panel links are slightly higher, the result would be occasional packet loss simply because the backplane cannot keep up. When a customer notices packet loss during stress tests in their lab it is very, very difficult to convince them it is expected and acceptable. Running the backplane links at a faster rate avoids the problem entirely, the backplane can always absorb line rate traffic from a front panel port.

QFabric

This is one area where QFabric takes a different tack from a modular chassis: I doubt the links between QF/Node and Interconnect are overclocked even slightly. In QFabric those links are not board traces they are lasers, and lasers are more picky about their data rate. In the Packet Pushers podcast the Juniper folks stated that the uplinks are required to be faster than the downlinks. That is, if the edge connections are 10 Gbps then the uplinks from Node to Interconnect have to be 40 Gbps. One reason for this is to avoid overruns due to clock frequency variance, and ensure the Interconnect can always absorb the link rate from an edge port without resorting to flow control.

Next article: flow control.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Tuesday, August 23, 2011

Making Stuff Up About QFabric

This is the first of several articles about the Juniper QFabric. I have not been briefed by Juniper, nor received any NDA information. These articles are written based on Juniper's public statements and materials available on the web, supplemented with details from a special Packet Pushers podcast, and topped off with a healthy amount of speculation and guessing about how it works.

Juniper QFabric with Nodes, Interconnect, and Director QFabric consists of edge nodes wired to two or four extremely large QF/Interconnect chassis, all managed via out of band links to QF/Directors. Juniper emphasizes that the collection of nodes, interconnects, and directors should be thought of as a single distributed switch rather than as a network. Packet handling within the QFabric is intended to be opaque, and the individual nodes are not separately configured. It is supposed to behave like one enormous, geographically distributed switch.

Therefore to try to brainstorm about how the distributed QFabric works we should think in terms of how a modular switch works, and how its functions might be distributed.

Exactly One Forwarding Decision

Ingress line card with switch fabric, connected to central supervisor with fabric, connected to egress line card with fabric Modular Ethernet switches have line cards which can switch between ports on the card, with fabric cards (also commonly called supervisory modules, route modules, or MSMs) between line cards. One might assume that each level of switching would function like we expect Ethernet switches to work, forwarding based on the L2 or L3 destination address. There are a number of reasons why this doesn't work very well, most troublesome of which are the consistency issues. There is a delay between when a packet is processed by the ingress line card and the fabric, and between the fabric and egress. The L2 and L3 tables can change between the time a packet hits one level of switching and the next, and its very, very hard to design a robust switching platform with so many corner cases and race conditions to worry about.

Control header prepended to frame Therefore all Ethernet switch silicon I know of relies on control headers prepended to the packet. A forwarding decision is made at exactly one place in the system, generally either the ingress line card or the central fabric cards. The forwarding decision includes any rewrites or tunnel encapsulations to be done, and determines the egress port. A header is prepended to the packet for the rest of its trip through the chassis, telling all remaining switch chips what to do with it. To avoid impacting the forwarding rate, these headers replace part of the Ethernet preamble.

Control header prepended to frame Generally the chips are configured to use these prepended control headers only on backplane links, and drop the header before the packet leaves the chassis. There are some exceptions where control headers are carried over external links to another box. Several companies sell variations on the port extender, a set of additional ports to be controlled remotely by a chassis switch. The link to the port extender will carry the control headers which would otherwise be restricted to the backplane. Similarly, several vendors sell stackable switches. Each unit in the stack can function as an independent switch, but can be connected via stack ports on the back to function together like a larger switch. The stack ports carry the prepended control headers from one stack member to the next, so the entire collection can function like a single forwarding plane.

QFabric

In the Packet Pushers podcast and in an article on EtherealMind, the Interconnect is described as a Clos network with stages of the Clos implemented in cards in the front and back of the chassis. It is implemented using merchant silicon, not Juniper ASICs. The technology in the edge Node was not specified, it is my assumption that Juniper uses its own silicon there.

Forwarding decisions are made in the Nodes and sent to the Interconnect, which is is a pure fabric with no decision making. This would be implemented by having the Nodes send control headers on their uplinks, in a format compatible with whatever merchant silicon is used in the Interconnect plus additional information needed to support the QFabric features. Juniper would not allow themselves to be locked in to a particular chip supplier, I'm sure the QF/Node implementation would be very flexible in how it creates those headers. A new QF/Interconnect with a different chipset would be supportable via a firmware upgrade to the edge nodes.

The QF/Interconnect would in turn forward the packet to its destination with the control header intact. The destination switch would perform whatever handling was indicated in the control information, discard the extra header, and forward the packet out the egress port.

Oversubscribed QF/Nodes

One interesting aspect of the first generation QF/Node is that it is oversubscribed. The QFX3500 has 480 Gbps of downlink capacity, in the form of 48 x 10G ports. It has 160 Gbps of uplink, via 4 x 40Gbps ports. Oversubscribed line cards are not unheard of in module chassis architectures, though it is generally the result of a followon generation of cards outstrips the capacity of the backplane. There have been chassis designs where the line cards were deliberately oversubscribed, but they are somewhat less common.

QFabric has a very impressive system to handle congestion and flow control, which will be the topic of a future article. The oversubscribed uplink is a slightly different variation, but in the end is really another form of congestion for the fabric to deal with. It would buffer what it can, and assert Ethernet flow control and/or some other means of backpressure to the edge ports if necessary.

Next article: link speeds.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, August 21, 2011

Consistency At All Levels

Twitter now wraps all links passing through the service with the t.co link shortener, which I wondered about a little while ago. Twitter engineers sweat the details, even the HTTP response headers are concise:

$ curl --head http://t.co/WXZtRHC
HTTP/1.1 301 Moved Permanently
Date: Sun, 21 Aug 2011 12:17:43 GMT
Server: hi
Location: http://codingrelic.geekhold.com/2011/07/tweetalytics.html
Cache-Control: private,max-age=300
Expires: Sun, 21 Aug 2011 12:22:43 GMT
Connection: close
Content-Type: text/html; charset=UTF-8

Oh, hi. Um, how are you?

Monday, August 15, 2011

An Awkward Segue to CPU Caching

Last week Andy Firth published The Demise of the Low Level Programmer, expressing dismay over the lack of low level systems knowledge displayed by younger engineers in the console game programming field. Andy's particular concerns deal with proper use of floating versus fixed point numbers, CPU cache behavior and branch prediction, bit manipulation, etc.

I have to admit a certain sympathy for this position. I've focussed on low level issues for much of my career. As I'm not in the games space, the specific topics I would offer differ somewhat: cache coherency with I/O, and page coloring, for example. Nonetheless, I feel a certain solidarity.

Yet I don't recall those topics being taught in school. I had classes which covered operating systems and virtual memory, but distinctly remember being shocked at the complications the first time I encountered a system which mandated page coloring. Similarly though I had a class on assembly programming, by the time I actually needed to work at that level I had to learn new instruction sets and many techniques.

In my experience at least, schools never did teach such topics. This stuff is learned by doing, as part of a project or on the job. The difference now is that fewer programmers are learning it. Its not because programmers are getting worse. I interview a lot of young engineers, their caliber is as high as I have ever experienced. It is simply that computing has grown a great deal in 20 years, there are a lot more topics available to learn, and frankly the cutting edge stuff has moved on. Even in the gaming space which spurred Andy's original article, big chunks of the market have been completely transformed. Twenty years ago casual gaming meant Game Boy, an environment so constrained that heroic optimization efforts were required. Now casual gaming means web based games on social networks. The relevant skill set has changed.

I'm sure Andy Firth is aware of the changes in the industry. Its simply that we have a tendency to assume that markets where there is a lot of money being made will inevitably attract new engineers, and so there should be a steady supply of new low level programmers for consoles. Unfortunately I don't believe that is true. Markets which are making lots of money don't attract young engineers. Markets which are perceived to be growing do, and other parts of the gaming market are perceived to be growing faster.

Page Coloring

Least significant bits as cache line offset, next few bits as cache index Because I brought it up earlier, we'll conclude with a discussion of page coloring. I am not satisfied with the Wikipedia page, which seems to have paraphrased a FreeBSD writeup describing page coloring as a performance issue. In some CPUs, albeit not current mainstream CPUs, coloring isn't just a performance issue. It is essential for correct operation.

Cache Index

Least significant bits as cache line offset, next few bits as cache index Before fetching a value from memory the CPU consults its cache. The least significant bits of the desired address are an offset into the cache line, generally 4, 5, or 6 bits for a 16/32/64 byte cache line.

The next few bits of the address are an index to select the cache line. It the cache has 1024 entries, then ten bits would be used as the index. Things get a bit more complicated here due to set associativity, which lets entries occupy several different locations to improve utilization. A two way set associative cache of 1024 entries would take 9 bits from the address and then check two possible locations. A four way set associative cache would use 8 bits. Etc.

Page tag

Least significant bits as page offset, upper bits as page tag Separately, the CPU defines a page size for the virtual memory system. 4 and 8 Kilobytes are common. The least significant bits of the address are the offset within the page, 12 or 13 bits for 4 or 8 K respectively. The most significant bits are a page number, used by the CPU cache as a tag. The hardware fetches the tag of the selected cache lines to check against the upper bits of the desired address. If they match, it is a cache hit and no access to DRAM is needed.

To reiterate: the tag is not the remaining bits of the address above the index and offset. The bits to be used for the tag are determined by the page size, and not directly tied to the details of the CPU cache indexing.

Virtual versus Physical

In the initial few stages of processing the load instruction the CPU has only the virtual address of the desired memory location. It will look up the virtual address in its TLB to get the physical address, but using the virtual address to access the cache is a performance win: the cache lookup can start earlier in the CPU pipeline. Its especially advantageous to use the virtual address for the cache index, as that processing happens earlier.

The tag is almost always taken from the physical address. Virtual tagging complicates shared memory across processes: the same physical page would have to be mapped at the same virtual address in all processes. That is an essentially impossible requirement to put on a VM system. Tag comparison happens later in the CPU pipeline, when the physical address will likely be available anyway, so it is (almost) universally taken from the physical address.

This is where page coloring comes into the picture.

Virtually Indexed, Physically Tagged

From everything described above, the size of the page tag is independent of the size of the cache index and offset. They are separate decisions, and frankly the page size is generally mandated. It is kept the same for all CPUs in a given architectural family even as they vary their cache implementations.

Consider then, the impact of a series of design choices:

32 bit CPU architecture
64 byte cache line: 6 bits of cache line offset
8K page size: 19 bits of page tag, 13 bits of page offset
512 entries in the L1 cache, direct mapped. 9 bits of cache index.
virtual indexing, for a shorter CPU pipeline. Physical tagging.
write back

Virtually indexed, physically tagged, with 2 bits of page color

What does this mean? It means the lower 15 bits of the virtual address and the upper 19 bits of the physical address are referenced while looking up items in the cache. Two of the bits overlap between the virtual and physical addresses. Those two bits are the page color. For proper operation, this CPU requires that all processes which map in a particular page do so at the same color. Though in theory the page could be any color so long as all mappings are the same, in practice the virtual color bits are set the same as the underlying physical page.

The impact of not enforcing page coloring is dire. A write in one process will be stored in one cache line, a read from another process will access a different cache line.

Page coloring like this places quite a burden on the VM system, and one which would be difficult to retrofit into an existing VM implementation. OS developers would push back against new CPUs which proposed to require coloring, and you used to see CPU designs making fairly odd tradeoffs in their L1 cache because of it. HP PA-RISC used a very small (but extremely fast) L1 cache. I think they did this to use direct mapped virtual indexing without needing page coloring. There were CPUs with really insane levels of set associativity in the L1 cache, 8 way or even 16 way. This reduced the number of index bits to the point where a virtual index wouldn't require coloring.

Thursday, July 28, 2011

ARP By Proxy

It started, as things often do nowadays, with a tweet. As part of a discussion of networking-fu I mentioned ProxyARP, and that it was no longer used. Ivan Pepelnjak corrected that it did still have a use. He wrote about it last year. I've tried to understand it, and wrote this post to be able to come back to later to remind myself.

Wayback Machine to 1985

ifconfig eth0 10.0.0.1 netmask 255.255.255.0

Thats it, right? You always configure an IP address plus subnet mask. The host will ARP for addresses on its subnet, and send to a router for addresses outside its subnet.

Yet it wasn't always that way. Subnet masks were retrofitted into IPv4 in the early 1980s. Before that there were no subnets. The host would AND the destination address with a class A/B/C mask, and send to the ARPANet for anything outside of its own network. Yes, this means a class A network would expect to have all 16 million hosts on a single Ethernet segment. This seems ludicrous now, but until the early 1980s it wasn't a real-world problem. There just weren't that many hosts at a site. The IPv4 address was widely perceived as being so large as to be infinite, only a small number of addresses would actually be used.

Aside: in the 1980s the 10.0.0.1 address had a different use than it does now. Back then it was the ARPAnet. It was the way you would send packets around the world. When ARPAnet was decommissioned, the 10.x.x.x address was made available for its modern for non-globally routed hosts.

Old host does not implement subnets, needs proxy ARP by router

There was a period of several years where subnet masks were gradually implemented by the operating systems of the day. My recollection is that BSD 4.0 did not implement subnets while 4.1 did, but this is probably wrong. In any case, once an organization decided to start using subnets it would need a way to deal with stragglers. The solution was Proxy ARP.

Its easy to detect a host which isn't using subnets: it will ARP for addresses which it shouldn't. The router examines incoming ARPs and, if off-segment, responds with its own MAC address. In effect the router will impersonate the remote system, so that hosts which don't implement subnet masking could still function in a subnetted world. The load on the router was unfortunate, but worthwhile.

Proxy ARP Today

That was decades ago. Yet Proxy ARP is still implemented in modern network equipment, and has some modern uses. One such case is in Ethernet access networks.

Subscriber network where each user gets a /30 block Consider a network using traditional L3 routing: you give each subscriber an IP address on their own IP subnet. You need to have a router address on the same subnet, and you need a broadcast address. Needing 3 IPs per subscriber means a /30. Thats 4 IP addresses allocated per customer.

There are some real advantages to giving each subscriber a separate subnet and requiring that all communication go through a router. Security is one, not allowing malware to spread from one subscriber to another without the service provider seeing it. Yet burning 4 IP addresses for each customer is painful.

Subscriber network using a /24 for all subscribers on the switch

To improve the utilization of IP addresses, we might configure the access gear to switch at L2 between subscribers on the same box. Now we only allocate one IP address per subscriber instead of four, but we expose all other subscribers in that L2 domain to potentially malicious traffic which the service provider cannot police.

We also end up with an inflexible network topology: it becomes arduous to change subnet allocations, because subscriber machines know how big the subnets are. As DHCP leases expire the customer systems should eventually learn of a new mask, but people sometimes do weird things with their configuration.

A final option relies on proxy ARP to decouple the subscriber's notion of the netmask from the real network topology. I'm basing this diagram on a comment by troyand on ioshints.com. Each subscriber is allocated a vlan by the distribution switch. The vlans themselves are unnumbered: no IP address. The subscriber is handed an IP address and netmask by DHCP, but the subscriber's netmask doesn't correspond to the actual network topology. They might be given a /16, but that doesn't mean sixty four thousand other subscribers are on the segment with them. The router uses Proxy ARP to catch attempts by the subscriber to communicate with nearby addresses.

This lets service providers get the best of both worlds: communication between subscribers goes through the service provider's equipment so it can enforce security policies, but only one IPv4 address per subscriber.

Saturday, July 23, 2011

Tweetalytics

Until this week I thought Twitter would focus on datamining the tweetstream rather than adding features for individual users. I based this in part on mentions by Fred Wilson of work by Twitter on analytics. I've been watching for evidence of changes I expected to be made in the service, intending to write about it if they appeared.

Earlier this week came news of a shakeup in product management at Twitter. Jack Dorsey seems much more focussed on user-visible aspects of the service, and I'm less convinced that backend analytics will be a priority now. Therefore I'm just going to write about the things I'd been watching for.

To reiterate: these are not things Twitter currently does, nor do I know they're looking at it. These are things which seemed logical, and would be visible outside the service.

Wrap all links: URLs passing through the firehose can be identified, but knowing what gets clicked is valuable. The twitter.com web client already wraps all URLs using t.co, regardless of their length. Taking the next step to shorten every non-t.co link passing through the system would be a way to get click data on everything. There is a downside in added latency to contact the shortener, but that is a product tradeoff to be made.

Unique t.co per retweet: There is already good visibility into how tweets spread through the system, by tracking new-style retweets and URL search for manual RTs. What is not currently visible is the point of egress from the service: which retweet actually gets clicked on. This can be useful if trying to measure a user's influence. An approximation can be made by looking at the number of followers, but that breaks down when retweeters have a similar number of followers. Instead, each retweet could generate a new t.co entry. The specific egress point would be known because each would have a unique URL.

Tracking beyond tweets: t.co tracks the first click. Once the link is expanded, there is no visibility into what happens. Tracking its spread once it leaves the service would require work with the individual sites, likely only practical for the top sites passing through the tweetstream. Tracking information could be automatically added to URLs before shortening, in a format suitable for the site's analytics. For example a utm_medium=tweet parameter could be added to the original URL. There might be some user displeasure at having the URL modified, which would have to be taken into account.

Each of these adds more information to be datamined by publishers. They don't result in user-visible features, and I suspect that as of a couple days ago user-visible features became a far higher priority.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Social label.

Monday, July 18, 2011

Python and XML Schemas

Python Logo My current project relies on a large number of XML Schema definition files. There are 1,600 types defined in various schemas, with actions for each type to be implemented as part of the project. A previous article examined CodeSynthesis XSD for C++ code generation from an XML Schema. This time we'll examine two packages for Python, GenerateDS and PyXB. Both were chosen based on their ability to feature prominently in search results.

In this article we'll work with the following schema and input data, the same used in the previous C++ discussion. It is my HR database of minions, for use when I become the Evil Overlord.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="minion">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="rank" type="xs:string"/>
      <xs:element name="serial" type="xs:positiveInteger"/>
    </xs:sequence>
    <xs:attribute name="loyalty" type="xs:float" use="required"/>
  </xs:complexType>
</xs:element>

</xs:schema>


<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<minion xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:noNamespaceSchemaLocation="schema.xsd" loyalty="0.2">
  <name>Agent Smith</name>
  <rank>Member of Minion Staff</rank>
  <serial>2</serial>
</minion>

The Python ElementTree can handle XML documents, so why generate code at all? One reason is simple readability.

Generated Code	ElementTree
m.name	m.find("name").text

A more subtle reason is to catch errors earlier. Because working with the underlying XML relies on passing in the node name as a string, a typo or misunderstanding of the XML schema will result in not finding the desired element and/or an exception. This is what unit tests are supposed to catch, but as the same developer implements the code and the unit test it is unlikely to catch a misinterpretation of the schema. With generated code, we can use static analysis tools like pylint to catch errors.

GenerateDS

The generateDS python script processes the XML schema:

python generateDS.py -o minion.py -s minionsubs.py minion.xsd

The generated code is in minion.py, while minionsubs.py contains an empty class definition for a subclass of minion. The generated class uses ElementTree for XML support, which is in the standard library in recent versions of Python. The minion class has properties for each node and attribute defined in the XSD. In our example this includes name, rank, serial, and loyalty.

import minion_generateds
if __name__ == '__main__':
  m = minion.parse("minion.xml")
  print '%s: %s, #%d (%f)' % (m.name, m.rank, m.serial, m.loyalty)

PyXB

The pyxbgen utility processes the XML schema:

pyxbgen -u minion.xsd -m minion

The generated code is in minion.py. The PyXB file is only 106 lines long, compared with 548 lines for GenerateDS. This doesn't tell the whole story, as the PyXB generated code imports the pyxb module where the generateDS code only depends on system modules. The pyxb package has to be pushed to production.

Very much like generateDS, the PyXB class has properties for each node and attribute defined in the XSD.

import minion_pyxb
if __name__ == '__main__':
  xml = file('minion.xml').read()
  m = minion.CreateFromDocument(xml)
  print '%s: %s, #%d (%f)' % (m.name, m.rank, m.serial, m.loyalty)

Pylint results

A primary reason for this exercise is to catch XML-related errors at build time, rather than exceptions in production. I don't believe unit tests are an effective way to verify that a developer has understood the XML schema.

To test this, a bogus 'm.fooberry' property reference was added to both test programs. pylint properly flagged a warning for the generateDS code.

E: 15: Instance of 'minion' has no 'fooberry' member (but some types could not be inferred)

pylint did not flag the error in the PyDB test code. I believe this is because PyDB doesn't name the generated class minion, instead it is named CTD_ANON with a runtime binding within its framework to "minion." pylint is doing a purely static analysis, and this kind of arrangement is beyond its ken.

class CTD_ANON (pyxb.binding.basis.complexTypeDefinition):
  ...

minion = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace,
           u'minion'), CTD_ANON)

Conclusion

As a primary goal of this effort is error detection via static analysis, we'll go with generateDS.

Saturday, July 16, 2011

Billions and Billions

In March, 2010 there were 50 million tweets per day.

In March, 2011 there were 140 million tweets per day.

In May, 2011 there were 155 million tweets per day.

Yesterday, apparently, there were 350 billion tweets per day.

350 million tweets/day would have been an astonishing 2.25x growth in just two months, where previously tweet volume has been increasing by 3x per year. 350 billion tweets/day is an unbelievable 2258x growth in just two months.

Quite unbelievable. In fact, I don't believe it.

350 billion tweets per day means about 4 million tweets per second. With metadata, each tweet is about 2500 bytes uncompressed. In May 2011 the Tweet firehose was still sent uncompressed, as not all consumers were ready for compression. 4 million tweets per second at 2500 bytes each works out to 80 Gigabits per second. Though its possible to build networks that fast, I'll assert without proof that it is not possible to build them in two months. Even assuming good compression is now used to get it down to ~200 bytes/tweet, that still works out to an average of 6.4 Gigabits per second. Peak tweet volumes are about 4x average, which means the peak would be 25 Gigabits per second. 25 Gigabits per second is a lot for modern servers to handle.

I think TwitterEng meant to say 350 million tweets per second. Thats still a breaktaking growth in the volume of data in just two months, and Twitter should be congratulated for operating the service so smoothly in the face of that growth.

Update: Daniel White and Atul Arora both noted that yesterday's tweet claimed 350 billion tweets delivered per day, where previous announcements have only discussed tweets per day. That probably means 350 billion recipients per day, or the number of tweets times the average fanout.

Update 2: In an interview on July 19, 2011 Twitter CEO Dick Costolo said 1 billion tweets are sent every 5 days, or 200 million tweets per day. This is more in line with previous growth rates.

Wednesday, July 13, 2011

Essayists and Orators

Recently Kevin Rose redirected his eponymous domain to his Google+ profile, reflecting that "G+ gives me more (real-time) feedback and engagement than my blog ever did." Earlier this year Steve Rubel deleted thousands of blog posts from older TypePad and Posterous sites, and started afresh on Tumblr.

Moving the center of one's online presence to "where the action is" is not a new phenomena. In 2008 Robert Scoble essentially abandoned his own sites in order to spend time on Friendfeed, the hot new social networking site at that time. Techcrunch even attempted an intervention over the move. After the Facebook acquisition of FriendFeed the site gradually decayed through benign neglect. Scobleizer moved on long ago.

Why do this? Surely its better to own your own domain and control your destiny? Or is it.

Essayists And Orators

In this discussion we'll focus on people who are online for more than just casual interaction or journaling, who have specific goals they are trying to accomplish with their online presence.

Essayists publish thoughtful prose, focussed on a particular topic. Presentation and style is important, but generally secondary to the density of ideas within. The product of their labor comes slowly, and is intended to stand for considerable time.

Orators can also deliver thoughtful ideas and spend considerable time preparing for it, but the dynamics are very different. The pace is faster, the interaction more frequent with less time to consider. The delivery and ideas can be adjusted over time, with each new presentation.

Translated to their online equivalents, I think we can still recognize the Essayist and Orator archetypes based on what they want people to find when they search. The world is a larger place now, when we want to know something outside of our knowledge we search for it.

For an Essayist, the desired result is a post with thoughts on the topic, linked to their name. For an Orator, the desired result is a conclusion that the orator is knowledgeable about the topic.

For an Essayist its important to keep material available for people to find, and in a form which links back to the author. Considerable effort has been spent to provide value up front. If someone needs more they can contact the author, who can provide additional help freely or with suitable compensation. Hosting on one's own site allows the linking of authorship to original material, and provides a stable contact point.

For an Orator, its more important that people find the author's name as someone knowledgeable about the topic. An Orator seeks contact much earlier in the process than an Essayist. They want a followup search to be for their name, to find out how to contact them. This desire for contact earlier in the process implies that the Orator will interact freely on many topics. At some point, if the searcher becomes convinced they can benefit from the Orator's expertise, they may discuss terms for further help.

For an Orator, its less important to have a stable presence online. The desired result is for someone to seek them out personally, and even if they move from one site to another search engines can be depended on to find their most recent incarnation.

I suspect this categorization paints with too broad a brush, as no one corresponds exactly to either archetype, but I'm finding it useful to consider.

Pictures of Abraham Lincoln and Frederick Douglas.

Lincoln and Douglas pictures courtesy Wikimedia Commons. Both are in the public domain in the United States.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Social label.

Tuesday, July 12, 2011

CodeSynthesis XSD Data Binding

Nowadays I make a habit of writing up how to use particular tools or techniques for anything which might be useful to reference later. Many techniques I worked on before starting this practice are now lost to me, locked away in proprietary source code at some previous employer.

This post concerns data binding from XML schemas in C++, generating classes rather than manipulating the underlying XML. As its written for Future Me, it might not be so interesting to those who are not Future Me.

Consider the simple XML schema shown below. I aspire to be the Evil Overlord, and am working on the HR system to keep track of my innumerable minions.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="minion">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="rank" type="xs:string"/>
      <xs:element name="serial" type="xs:positiveInteger"/>
    </xs:sequence>
    <xs:attribute name="loyalty" type="xs:float" use="required"/>
  </xs:complexType>
</xs:element>

</xs:schema>

It would be possible to parse documents created from this schema manually, using something like libexpat or Xerces. Unfortunately as the schema becomes large, the likelihood of mistakes in this manual process becomes overwhelming.

I chose instead to work with CodeSynthesis XSD to generate classes from the schema, based mainly on the Free/Libre Open Source Software Exception in their license. This project will eventually be released under an Apache-style license, and all other data binding solutions I found for C++ were either GPL or a commercial license.

Parsing from XML

The generated code provides a number of function prototypes to parse XML from various sources, including iostreams.

std::istringstream agent_smith(
  "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>"
  "<minion xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" "
  "xsi:noNamespaceSchemaLocation=\"schema.xsd\" loyalty=\"0.2\">"
  "<name>Agent Smith</name>"
  "<rank>Member of Minion Staff</rank>"
  "<serial>2</serial>"
  "</minion>");
std::auto_ptr m(NULL);

try {
  m = minion_(agent_smith);
} catch (const xml_schema::exception& e) {
  std::cerr << e << std::endl;
  return;
}

The minion object now contains data members with proper C++ types for each XML node and attribute.

std::cout << "Name: " << m->name() << std::endl
          << "Loyalty: " << m->loyalty() << std::endl
          << "Rank: " << m->rank() << std::endl
          << "Serial number: " << m->serial() << std::endl;

Serialization to XML

Methods to serialize an object to XML are not generated by default, the --generate-serialization flag has to be passed to xsdcxx. This emits another series of minion_ methods, which take output arguments.

int main() {
  minion m("Salacious Crumb", "Senior Lackey", 1, 0.9);
  minion_(std::cout, m);
}

This sends the XML to stdout.

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<minion loyalty="0.9">
  <name>Salacious Crumb</name>
  <rank>Senior Lackey</rank>
  <serial>1</serial>
</minion>

Codesynthesis relies on Xerces-C++ to provide the lower layer XML handling, so all of the functionality of that library is also available to the application.

Thats enough for now. See you later, Future Me.