Coding Relic

Friday, August 26, 2011

QFabric Conclusion

This is the fourth and final article in a series exploring the Juniper QFabric. Earlier articles provided an overview, a discussion of link speed, and musings on flow control. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article attempts to cover a few loose ends, and wraps up the series.

As with previous days, the flow control post sparked an interesting discussion on Google+.

Whither Protocols?

Director connected to edge node and peered with another switch For a number of years switch and router manufacturers competed on protocol support, implementing various extensions to OSPF/BGP/SpanningTree/etc in their software. QFabric is almost completely silent about protocols. In part this is a marketing philosophy: positioning the QFabric as a distributed switch instead of a network means that the protocols running within the fabric are an implementation detail, not something to talk about. I don't know what protocols are run between the nodes of the QFabric, but I'm sure its not Spanning Tree and OSPF.

Yet QFabric will need to connect to other network elements at its edge, where the datacenter connects to the outside world. Presumably the routing protocols it needs are implemented in the QF/Director and piped over to whichever switch ports connect to the rest of the network. If there are multiple peering points, they need to communicate with the same entity and a common routing information base.

Flooding Frowned Upon

The edge Nodes have an L2 table holding 96K MAC addresses. This reinforces the notion that switching decisions are made at the ingress edge, every Node can know how to reach destination MAC addresses at every port. There are a few options for distributing MAC address information to all of the nodes, but I suspect that flooding unknown addresses to all ports is not the preferred mechanism. If flooding is allowed at all, it would be carefully controlled.

Much of modern datacenter design revolves around virtualization. The VMWare vCenter (or equivalent) is a single, authoritative source of topology information for virtual servers. By hooking to the VM management system, the QFabric Director could know the expected port and VLAN for each server MAC address. The Node L2 tables could be pre-populated accordingly.

By hooking to the VM management console QFabric could also coordinate VLANs, flow control settings, and other network settings with the virtual switches running in software.

NetOps Force Multiplier

Where previously network engineers would be configuring dozens of switches, QFabric now proposes to manage a single distributed switch. Done well, this should be a substantial time saver. There will of course be cases where the abstraction leaks and the individual Nodes have to be dealt with. The failure modes in a distributed switch are simply different. Its unlikely that a single line card within a chassis will unexpectedly lose power, but its almost certain that Nodes occasionally will. Nonetheless, the cost to operate QFabric seems promising.

Conclusion

QFabric is an impressive piece of work, clearly the result of several years effort. Though the Interconnects use merchant silicon, Juniper almost certainly started working with the manufacturer at the start of the project to ensure the chip would meet their needs.

The most interesting part of QFabric is its flow control mechanism, for which Juniper has made some pretty stunning claims. A flow control mechanism with fairness, no packet loss, and quick reaction to changes over such a large topology is an impressive feat.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Thursday, August 25, 2011

QFabric: Flow Control Considered Harmful

This is the third in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

Yesterday's post sparked an excellent discussion on Google+, disagreeing about overclocking of backplane links and suggesting LACP hashing as a reason for the requirement for 40 Gbps links. Ken Duda is a smart guy.

XON/XOFF

Ethernet includes (optional) link level flow control. An Ethernet device can signal its neighbor to stop sending traffic, requiring the neighbor to buffer packets until told it can resume sending. This flow control mechanism is universally implemented in the MAC hardware, not requiring software intervention on every flow control event.

Switch chip with one congested port flow controlling the entire link The flow control is for the whole link. For a switch with multiple downstream ports, there is no way to signal back to the sender that only some of the ports are congested. In this diagram, a single congested port requires the upstream to be flow controlled, even though the other port could accept more packets. Ethernet flow control suffers from head of line blocking, a single congested port will choke off traffic to other uncongested ports.

IEEE has tried to address this in two ways. Priority-based flow control, defined in IEEE 802.1Qbb, makes each of the eight classes of service flow control independently. Bulk data traffic would no longer block VoIP, for example. IEEE 802.1au is defining a congestion notification capability to send an explicit notification to compatible endstations, asking them to slow down. Per-priority pause is already appearing in some products. Congestion notification involves NICs and client operating systems, and is slower going.

QFabric

Juniper's whitepaper on QFabric has this to say about congestion:

Finally, the only apparent congestion is due to the limited bandwidth of ingress and egress interfaces and any congestion of egress interfaces does not affect ingress interfaces sending to non-congested interfaces; this noninterference is referred to as “non-blocking”

QFabric very clearly does not rely on Ethernet flow control for the links between edge nodes and interconnect, not even per-priority. It does something else, something which can reflect the queue state of an egress port on the far side of the fabric back to control the ingress. However, Juniper has said nothing that I can find about how this works. They rightly consider it part of their competitive advantage.

So... lets make stuff up. How might this be done?

Juniper has said that the Interconnect uses merchant switch silicon, but hasn't said anything about the edge nodes. As all the interesting stuff is done at the edge and Juniper has a substantial ASIC development capability, I'd assume they are using their own silicon there. Whatever mechanism they use for flow control would be implemented in the Nodes.

Most of the mechanisms I can think of require substantial buffering. QFX3500, the first QFabric edge node, has 4x40 Gbps ports connecting to the Interconnect. That is 20 gigabytes per second. A substantial bank of SRAM could buffer that traffic for a fraction of a second. For example, 256 Megabytes could absorb a bit over 13 milliseconds worth of traffic. 13 milliseconds provides a lot of time for real-time software to swing into action.

For example:

The memory could be used for egress buffering at the edge ports. Each Node could report its buffer status to all other nodes in the QFabric, several thousand times per second. Adding a few thousand packets per second from each switch is an insignificant load on the fabric. From there we can imagine a distributed flow control mechanism, which several thousand times per second would re-evaluate how much traffic it is allowed to send to each remote node in the fabric. Ethernet flow control frames could be sent to the sending edge port to slow it down.

Or rather, smarter people than me can imagine how to construct a distributed flow control mechanism. My brain hurts just thinking about it.
Hardware counters could track the rate each Node is receiving traffic from each other Node, and software could report this several times per second. Part of the memory could be used as uplink buffering, with traffic shapers controlling the rate sent to each remote Node. Software could adjust the traffic shaping several thousand times per second to achieve fairness.

Substantial uplink buffering also helps with oversubscribed uplinks. QFX3500 has 3:1 oversubscription.

I'll reiterate that the bullet points above are not how QFabric actually works, I made them up. Juniper isn't revealing how QFabric flow control works. If it achieves the results claimed it is a massive competitive advantage. I'm listing these to illustrate a point: a company willing to design and support its own switch silicon can take a wholly different approach from the rest of the industry. In the best case, they end up with an advantage for several years. In the worst case, they don't keep up with merchant silicon. I've been down both of those paths, the second is not fun.

There are precedents for the kind of systems described above: Infiniband reports queue information to neighbor IB nodes, and from this works out a flow control policy. Infiniband was designed in an earlier age when software wouldn't be able to handle the strict timing requirements, so it is defined in terms of hardware token buckets.

One thing I am sure of is that the QF/Director does not get involved in flow control. That would be a scalability and reliability problem. Juniper stated that the QFabric will continue to forward traffic even if the Director nodes have failed, the flow control mechanism cannot rely on the Directors.

Next article: wrap-up.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, August 24, 2011

The Backplane Goes To Eleven

This is the second in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

How Fast is Fast?

Ethernet link speeds are rigidly specified by the IEEE 802.3 standards, for example at 1/10/40/100/etc Gbps. It is important that Ethernet implementations from different vendors be able to interoperate.

Backplane links face fewer constraints, as there is little requirement for interoperability between implementations. Even if one wanted to it isn't possible to plug a card from one vendor into another vendor's chassis, they simply don't fit. Therefore backplane links within a chassis have been free to tweak their links speeds for better performance. In the 10 Gbps generation of products backplane links have typically run at 12.5 Gbps. In the 40 Gbps Ethernet generation I'd expect 45 or 50 Gbps backplane links (I don't really know, I no longer work in that space). A well-designed SERDES for a particular speed will have a bit of headroom, enabling faster operation over high quality links like a backplane. Running them faster tends to be possible without heroic effort.

Switch chip with 48 x 1Gbps downlinks and 4 x 10/12.5 Gbps uplinks A common spec for switch silicon in the 1Gbps generation is 48 x 1 Gbps ports plus 4 x 10 Gbps. Depending on the product requirements, 10 Gbps ports can be used for server attachment or as uplinks to build into a larger switch. At first glance the chassis application appears to be somewhat oversubscribed, with 48 Gbs of downlink but only 40 Gbps of uplink. In reality, when used in a chassis the uplink ports will run at 12.5 Gbps to get 50 Gbps of uplink bandwidth.

Though improved throughput and capacity is a big reason for running the backplane links faster, there is a more subtle benefit as well. Ethernet links are specified to run at some nominal clock frequency, modulo a tolerance measured in parts per million. The crystals driving the links can be slightly on the high or low side of the desired frequency yet still be within spec. If the backplane links happen to run slightly below nominal while the front panel links are slightly higher, the result would be occasional packet loss simply because the backplane cannot keep up. When a customer notices packet loss during stress tests in their lab it is very, very difficult to convince them it is expected and acceptable. Running the backplane links at a faster rate avoids the problem entirely, the backplane can always absorb line rate traffic from a front panel port.

QFabric

This is one area where QFabric takes a different tack from a modular chassis: I doubt the links between QF/Node and Interconnect are overclocked even slightly. In QFabric those links are not board traces they are lasers, and lasers are more picky about their data rate. In the Packet Pushers podcast the Juniper folks stated that the uplinks are required to be faster than the downlinks. That is, if the edge connections are 10 Gbps then the uplinks from Node to Interconnect have to be 40 Gbps. One reason for this is to avoid overruns due to clock frequency variance, and ensure the Interconnect can always absorb the link rate from an edge port without resorting to flow control.

Next article: flow control.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Tuesday, August 23, 2011

Making Stuff Up About QFabric

This is the first of several articles about the Juniper QFabric. I have not been briefed by Juniper, nor received any NDA information. These articles are written based on Juniper's public statements and materials available on the web, supplemented with details from a special Packet Pushers podcast, and topped off with a healthy amount of speculation and guessing about how it works.

Juniper QFabric with Nodes, Interconnect, and Director QFabric consists of edge nodes wired to two or four extremely large QF/Interconnect chassis, all managed via out of band links to QF/Directors. Juniper emphasizes that the collection of nodes, interconnects, and directors should be thought of as a single distributed switch rather than as a network. Packet handling within the QFabric is intended to be opaque, and the individual nodes are not separately configured. It is supposed to behave like one enormous, geographically distributed switch.

Therefore to try to brainstorm about how the distributed QFabric works we should think in terms of how a modular switch works, and how its functions might be distributed.

Exactly One Forwarding Decision

Ingress line card with switch fabric, connected to central supervisor with fabric, connected to egress line card with fabric Modular Ethernet switches have line cards which can switch between ports on the card, with fabric cards (also commonly called supervisory modules, route modules, or MSMs) between line cards. One might assume that each level of switching would function like we expect Ethernet switches to work, forwarding based on the L2 or L3 destination address. There are a number of reasons why this doesn't work very well, most troublesome of which are the consistency issues. There is a delay between when a packet is processed by the ingress line card and the fabric, and between the fabric and egress. The L2 and L3 tables can change between the time a packet hits one level of switching and the next, and its very, very hard to design a robust switching platform with so many corner cases and race conditions to worry about.

Control header prepended to frame Therefore all Ethernet switch silicon I know of relies on control headers prepended to the packet. A forwarding decision is made at exactly one place in the system, generally either the ingress line card or the central fabric cards. The forwarding decision includes any rewrites or tunnel encapsulations to be done, and determines the egress port. A header is prepended to the packet for the rest of its trip through the chassis, telling all remaining switch chips what to do with it. To avoid impacting the forwarding rate, these headers replace part of the Ethernet preamble.

Control header prepended to frame Generally the chips are configured to use these prepended control headers only on backplane links, and drop the header before the packet leaves the chassis. There are some exceptions where control headers are carried over external links to another box. Several companies sell variations on the port extender, a set of additional ports to be controlled remotely by a chassis switch. The link to the port extender will carry the control headers which would otherwise be restricted to the backplane. Similarly, several vendors sell stackable switches. Each unit in the stack can function as an independent switch, but can be connected via stack ports on the back to function together like a larger switch. The stack ports carry the prepended control headers from one stack member to the next, so the entire collection can function like a single forwarding plane.

QFabric

In the Packet Pushers podcast and in an article on EtherealMind, the Interconnect is described as a Clos network with stages of the Clos implemented in cards in the front and back of the chassis. It is implemented using merchant silicon, not Juniper ASICs. The technology in the edge Node was not specified, it is my assumption that Juniper uses its own silicon there.

Forwarding decisions are made in the Nodes and sent to the Interconnect, which is is a pure fabric with no decision making. This would be implemented by having the Nodes send control headers on their uplinks, in a format compatible with whatever merchant silicon is used in the Interconnect plus additional information needed to support the QFabric features. Juniper would not allow themselves to be locked in to a particular chip supplier, I'm sure the QF/Node implementation would be very flexible in how it creates those headers. A new QF/Interconnect with a different chipset would be supportable via a firmware upgrade to the edge nodes.

The QF/Interconnect would in turn forward the packet to its destination with the control header intact. The destination switch would perform whatever handling was indicated in the control information, discard the extra header, and forward the packet out the egress port.

Oversubscribed QF/Nodes

One interesting aspect of the first generation QF/Node is that it is oversubscribed. The QFX3500 has 480 Gbps of downlink capacity, in the form of 48 x 10G ports. It has 160 Gbps of uplink, via 4 x 40Gbps ports. Oversubscribed line cards are not unheard of in module chassis architectures, though it is generally the result of a followon generation of cards outstrips the capacity of the backplane. There have been chassis designs where the line cards were deliberately oversubscribed, but they are somewhat less common.

QFabric has a very impressive system to handle congestion and flow control, which will be the topic of a future article. The oversubscribed uplink is a slightly different variation, but in the end is really another form of congestion for the fabric to deal with. It would buffer what it can, and assert Ethernet flow control and/or some other means of backpressure to the edge ports if necessary.

Next article: link speeds.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Sunday, August 21, 2011

Consistency At All Levels

Twitter now wraps all links passing through the service with the t.co link shortener, which I wondered about a little while ago. Twitter engineers sweat the details, even the HTTP response headers are concise:

$ curl --head http://t.co/WXZtRHC
HTTP/1.1 301 Moved Permanently
Date: Sun, 21 Aug 2011 12:17:43 GMT
Server: hi
Location: http://codingrelic.geekhold.com/2011/07/tweetalytics.html
Cache-Control: private,max-age=300
Expires: Sun, 21 Aug 2011 12:22:43 GMT
Connection: close
Content-Type: text/html; charset=UTF-8

Oh, hi. Um, how are you?

Monday, August 15, 2011

An Awkward Segue to CPU Caching

Last week Andy Firth published The Demise of the Low Level Programmer, expressing dismay over the lack of low level systems knowledge displayed by younger engineers in the console game programming field. Andy's particular concerns deal with proper use of floating versus fixed point numbers, CPU cache behavior and branch prediction, bit manipulation, etc.

I have to admit a certain sympathy for this position. I've focussed on low level issues for much of my career. As I'm not in the games space, the specific topics I would offer differ somewhat: cache coherency with I/O, and page coloring, for example. Nonetheless, I feel a certain solidarity.

Yet I don't recall those topics being taught in school. I had classes which covered operating systems and virtual memory, but distinctly remember being shocked at the complications the first time I encountered a system which mandated page coloring. Similarly though I had a class on assembly programming, by the time I actually needed to work at that level I had to learn new instruction sets and many techniques.

In my experience at least, schools never did teach such topics. This stuff is learned by doing, as part of a project or on the job. The difference now is that fewer programmers are learning it. Its not because programmers are getting worse. I interview a lot of young engineers, their caliber is as high as I have ever experienced. It is simply that computing has grown a great deal in 20 years, there are a lot more topics available to learn, and frankly the cutting edge stuff has moved on. Even in the gaming space which spurred Andy's original article, big chunks of the market have been completely transformed. Twenty years ago casual gaming meant Game Boy, an environment so constrained that heroic optimization efforts were required. Now casual gaming means web based games on social networks. The relevant skill set has changed.

I'm sure Andy Firth is aware of the changes in the industry. Its simply that we have a tendency to assume that markets where there is a lot of money being made will inevitably attract new engineers, and so there should be a steady supply of new low level programmers for consoles. Unfortunately I don't believe that is true. Markets which are making lots of money don't attract young engineers. Markets which are perceived to be growing do, and other parts of the gaming market are perceived to be growing faster.

Page Coloring

Least significant bits as cache line offset, next few bits as cache index Because I brought it up earlier, we'll conclude with a discussion of page coloring. I am not satisfied with the Wikipedia page, which seems to have paraphrased a FreeBSD writeup describing page coloring as a performance issue. In some CPUs, albeit not current mainstream CPUs, coloring isn't just a performance issue. It is essential for correct operation.

Cache Index

Least significant bits as cache line offset, next few bits as cache index Before fetching a value from memory the CPU consults its cache. The least significant bits of the desired address are an offset into the cache line, generally 4, 5, or 6 bits for a 16/32/64 byte cache line.

The next few bits of the address are an index to select the cache line. It the cache has 1024 entries, then ten bits would be used as the index. Things get a bit more complicated here due to set associativity, which lets entries occupy several different locations to improve utilization. A two way set associative cache of 1024 entries would take 9 bits from the address and then check two possible locations. A four way set associative cache would use 8 bits. Etc.

Page tag

Least significant bits as page offset, upper bits as page tag Separately, the CPU defines a page size for the virtual memory system. 4 and 8 Kilobytes are common. The least significant bits of the address are the offset within the page, 12 or 13 bits for 4 or 8 K respectively. The most significant bits are a page number, used by the CPU cache as a tag. The hardware fetches the tag of the selected cache lines to check against the upper bits of the desired address. If they match, it is a cache hit and no access to DRAM is needed.

To reiterate: the tag is not the remaining bits of the address above the index and offset. The bits to be used for the tag are determined by the page size, and not directly tied to the details of the CPU cache indexing.

Virtual versus Physical

In the initial few stages of processing the load instruction the CPU has only the virtual address of the desired memory location. It will look up the virtual address in its TLB to get the physical address, but using the virtual address to access the cache is a performance win: the cache lookup can start earlier in the CPU pipeline. Its especially advantageous to use the virtual address for the cache index, as that processing happens earlier.

The tag is almost always taken from the physical address. Virtual tagging complicates shared memory across processes: the same physical page would have to be mapped at the same virtual address in all processes. That is an essentially impossible requirement to put on a VM system. Tag comparison happens later in the CPU pipeline, when the physical address will likely be available anyway, so it is (almost) universally taken from the physical address.

This is where page coloring comes into the picture.

Virtually Indexed, Physically Tagged

From everything described above, the size of the page tag is independent of the size of the cache index and offset. They are separate decisions, and frankly the page size is generally mandated. It is kept the same for all CPUs in a given architectural family even as they vary their cache implementations.

Consider then, the impact of a series of design choices:

32 bit CPU architecture
64 byte cache line: 6 bits of cache line offset
8K page size: 19 bits of page tag, 13 bits of page offset
512 entries in the L1 cache, direct mapped. 9 bits of cache index.
virtual indexing, for a shorter CPU pipeline. Physical tagging.
write back

Virtually indexed, physically tagged, with 2 bits of page color

What does this mean? It means the lower 15 bits of the virtual address and the upper 19 bits of the physical address are referenced while looking up items in the cache. Two of the bits overlap between the virtual and physical addresses. Those two bits are the page color. For proper operation, this CPU requires that all processes which map in a particular page do so at the same color. Though in theory the page could be any color so long as all mappings are the same, in practice the virtual color bits are set the same as the underlying physical page.

The impact of not enforcing page coloring is dire. A write in one process will be stored in one cache line, a read from another process will access a different cache line.

Page coloring like this places quite a burden on the VM system, and one which would be difficult to retrofit into an existing VM implementation. OS developers would push back against new CPUs which proposed to require coloring, and you used to see CPU designs making fairly odd tradeoffs in their L1 cache because of it. HP PA-RISC used a very small (but extremely fast) L1 cache. I think they did this to use direct mapped virtual indexing without needing page coloring. There were CPUs with really insane levels of set associativity in the L1 cache, 8 way or even 16 way. This reduced the number of index bits to the point where a virtual index wouldn't require coloring.

Thursday, July 28, 2011

ARP By Proxy

It started, as things often do nowadays, with a tweet. As part of a discussion of networking-fu I mentioned ProxyARP, and that it was no longer used. Ivan Pepelnjak corrected that it did still have a use. He wrote about it last year. I've tried to understand it, and wrote this post to be able to come back to later to remind myself.

Wayback Machine to 1985

ifconfig eth0 10.0.0.1 netmask 255.255.255.0

Thats it, right? You always configure an IP address plus subnet mask. The host will ARP for addresses on its subnet, and send to a router for addresses outside its subnet.

Yet it wasn't always that way. Subnet masks were retrofitted into IPv4 in the early 1980s. Before that there were no subnets. The host would AND the destination address with a class A/B/C mask, and send to the ARPANet for anything outside of its own network. Yes, this means a class A network would expect to have all 16 million hosts on a single Ethernet segment. This seems ludicrous now, but until the early 1980s it wasn't a real-world problem. There just weren't that many hosts at a site. The IPv4 address was widely perceived as being so large as to be infinite, only a small number of addresses would actually be used.

Aside: in the 1980s the 10.0.0.1 address had a different use than it does now. Back then it was the ARPAnet. It was the way you would send packets around the world. When ARPAnet was decommissioned, the 10.x.x.x address was made available for its modern for non-globally routed hosts.

Old host does not implement subnets, needs proxy ARP by router

There was a period of several years where subnet masks were gradually implemented by the operating systems of the day. My recollection is that BSD 4.0 did not implement subnets while 4.1 did, but this is probably wrong. In any case, once an organization decided to start using subnets it would need a way to deal with stragglers. The solution was Proxy ARP.

Its easy to detect a host which isn't using subnets: it will ARP for addresses which it shouldn't. The router examines incoming ARPs and, if off-segment, responds with its own MAC address. In effect the router will impersonate the remote system, so that hosts which don't implement subnet masking could still function in a subnetted world. The load on the router was unfortunate, but worthwhile.

Proxy ARP Today

That was decades ago. Yet Proxy ARP is still implemented in modern network equipment, and has some modern uses. One such case is in Ethernet access networks.

Subscriber network where each user gets a /30 block Consider a network using traditional L3 routing: you give each subscriber an IP address on their own IP subnet. You need to have a router address on the same subnet, and you need a broadcast address. Needing 3 IPs per subscriber means a /30. Thats 4 IP addresses allocated per customer.

There are some real advantages to giving each subscriber a separate subnet and requiring that all communication go through a router. Security is one, not allowing malware to spread from one subscriber to another without the service provider seeing it. Yet burning 4 IP addresses for each customer is painful.

Subscriber network using a /24 for all subscribers on the switch

To improve the utilization of IP addresses, we might configure the access gear to switch at L2 between subscribers on the same box. Now we only allocate one IP address per subscriber instead of four, but we expose all other subscribers in that L2 domain to potentially malicious traffic which the service provider cannot police.

We also end up with an inflexible network topology: it becomes arduous to change subnet allocations, because subscriber machines know how big the subnets are. As DHCP leases expire the customer systems should eventually learn of a new mask, but people sometimes do weird things with their configuration.

A final option relies on proxy ARP to decouple the subscriber's notion of the netmask from the real network topology. I'm basing this diagram on a comment by troyand on ioshints.com. Each subscriber is allocated a vlan by the distribution switch. The vlans themselves are unnumbered: no IP address. The subscriber is handed an IP address and netmask by DHCP, but the subscriber's netmask doesn't correspond to the actual network topology. They might be given a /16, but that doesn't mean sixty four thousand other subscribers are on the segment with them. The router uses Proxy ARP to catch attempts by the subscriber to communicate with nearby addresses.

This lets service providers get the best of both worlds: communication between subscribers goes through the service provider's equipment so it can enforce security policies, but only one IPv4 address per subscriber.

Saturday, July 23, 2011

Tweetalytics

Until this week I thought Twitter would focus on datamining the tweetstream rather than adding features for individual users. I based this in part on mentions by Fred Wilson of work by Twitter on analytics. I've been watching for evidence of changes I expected to be made in the service, intending to write about it if they appeared.

Earlier this week came news of a shakeup in product management at Twitter. Jack Dorsey seems much more focussed on user-visible aspects of the service, and I'm less convinced that backend analytics will be a priority now. Therefore I'm just going to write about the things I'd been watching for.

To reiterate: these are not things Twitter currently does, nor do I know they're looking at it. These are things which seemed logical, and would be visible outside the service.

Wrap all links: URLs passing through the firehose can be identified, but knowing what gets clicked is valuable. The twitter.com web client already wraps all URLs using t.co, regardless of their length. Taking the next step to shorten every non-t.co link passing through the system would be a way to get click data on everything. There is a downside in added latency to contact the shortener, but that is a product tradeoff to be made.

Unique t.co per retweet: There is already good visibility into how tweets spread through the system, by tracking new-style retweets and URL search for manual RTs. What is not currently visible is the point of egress from the service: which retweet actually gets clicked on. This can be useful if trying to measure a user's influence. An approximation can be made by looking at the number of followers, but that breaks down when retweeters have a similar number of followers. Instead, each retweet could generate a new t.co entry. The specific egress point would be known because each would have a unique URL.

Tracking beyond tweets: t.co tracks the first click. Once the link is expanded, there is no visibility into what happens. Tracking its spread once it leaves the service would require work with the individual sites, likely only practical for the top sites passing through the tweetstream. Tracking information could be automatically added to URLs before shortening, in a format suitable for the site's analytics. For example a utm_medium=tweet parameter could be added to the original URL. There might be some user displeasure at having the URL modified, which would have to be taken into account.

Each of these adds more information to be datamined by publishers. They don't result in user-visible features, and I suspect that as of a couple days ago user-visible features became a far higher priority.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Social label.

Monday, July 18, 2011

Python and XML Schemas

Python Logo My current project relies on a large number of XML Schema definition files. There are 1,600 types defined in various schemas, with actions for each type to be implemented as part of the project. A previous article examined CodeSynthesis XSD for C++ code generation from an XML Schema. This time we'll examine two packages for Python, GenerateDS and PyXB. Both were chosen based on their ability to feature prominently in search results.

In this article we'll work with the following schema and input data, the same used in the previous C++ discussion. It is my HR database of minions, for use when I become the Evil Overlord.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="minion">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="rank" type="xs:string"/>
      <xs:element name="serial" type="xs:positiveInteger"/>
    </xs:sequence>
    <xs:attribute name="loyalty" type="xs:float" use="required"/>
  </xs:complexType>
</xs:element>

</xs:schema>


<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<minion xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:noNamespaceSchemaLocation="schema.xsd" loyalty="0.2">
  <name>Agent Smith</name>
  <rank>Member of Minion Staff</rank>
  <serial>2</serial>
</minion>

The Python ElementTree can handle XML documents, so why generate code at all? One reason is simple readability.

Generated Code	ElementTree
m.name	m.find("name").text

A more subtle reason is to catch errors earlier. Because working with the underlying XML relies on passing in the node name as a string, a typo or misunderstanding of the XML schema will result in not finding the desired element and/or an exception. This is what unit tests are supposed to catch, but as the same developer implements the code and the unit test it is unlikely to catch a misinterpretation of the schema. With generated code, we can use static analysis tools like pylint to catch errors.

GenerateDS

The generateDS python script processes the XML schema:

python generateDS.py -o minion.py -s minionsubs.py minion.xsd

The generated code is in minion.py, while minionsubs.py contains an empty class definition for a subclass of minion. The generated class uses ElementTree for XML support, which is in the standard library in recent versions of Python. The minion class has properties for each node and attribute defined in the XSD. In our example this includes name, rank, serial, and loyalty.

import minion_generateds
if __name__ == '__main__':
  m = minion.parse("minion.xml")
  print '%s: %s, #%d (%f)' % (m.name, m.rank, m.serial, m.loyalty)

PyXB

The pyxbgen utility processes the XML schema:

pyxbgen -u minion.xsd -m minion

The generated code is in minion.py. The PyXB file is only 106 lines long, compared with 548 lines for GenerateDS. This doesn't tell the whole story, as the PyXB generated code imports the pyxb module where the generateDS code only depends on system modules. The pyxb package has to be pushed to production.

Very much like generateDS, the PyXB class has properties for each node and attribute defined in the XSD.

import minion_pyxb
if __name__ == '__main__':
  xml = file('minion.xml').read()
  m = minion.CreateFromDocument(xml)
  print '%s: %s, #%d (%f)' % (m.name, m.rank, m.serial, m.loyalty)

Pylint results

A primary reason for this exercise is to catch XML-related errors at build time, rather than exceptions in production. I don't believe unit tests are an effective way to verify that a developer has understood the XML schema.

To test this, a bogus 'm.fooberry' property reference was added to both test programs. pylint properly flagged a warning for the generateDS code.

E: 15: Instance of 'minion' has no 'fooberry' member (but some types could not be inferred)

pylint did not flag the error in the PyDB test code. I believe this is because PyDB doesn't name the generated class minion, instead it is named CTD_ANON with a runtime binding within its framework to "minion." pylint is doing a purely static analysis, and this kind of arrangement is beyond its ken.

class CTD_ANON (pyxb.binding.basis.complexTypeDefinition):
  ...

minion = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace,
           u'minion'), CTD_ANON)

Conclusion

As a primary goal of this effort is error detection via static analysis, we'll go with generateDS.

Saturday, July 16, 2011

Billions and Billions

In March, 2010 there were 50 million tweets per day.

In March, 2011 there were 140 million tweets per day.

In May, 2011 there were 155 million tweets per day.

Yesterday, apparently, there were 350 billion tweets per day.

350 million tweets/day would have been an astonishing 2.25x growth in just two months, where previously tweet volume has been increasing by 3x per year. 350 billion tweets/day is an unbelievable 2258x growth in just two months.

Quite unbelievable. In fact, I don't believe it.

350 billion tweets per day means about 4 million tweets per second. With metadata, each tweet is about 2500 bytes uncompressed. In May 2011 the Tweet firehose was still sent uncompressed, as not all consumers were ready for compression. 4 million tweets per second at 2500 bytes each works out to 80 Gigabits per second. Though its possible to build networks that fast, I'll assert without proof that it is not possible to build them in two months. Even assuming good compression is now used to get it down to ~200 bytes/tweet, that still works out to an average of 6.4 Gigabits per second. Peak tweet volumes are about 4x average, which means the peak would be 25 Gigabits per second. 25 Gigabits per second is a lot for modern servers to handle.

I think TwitterEng meant to say 350 million tweets per second. Thats still a breaktaking growth in the volume of data in just two months, and Twitter should be congratulated for operating the service so smoothly in the face of that growth.

Update: Daniel White and Atul Arora both noted that yesterday's tweet claimed 350 billion tweets delivered per day, where previous announcements have only discussed tweets per day. That probably means 350 billion recipients per day, or the number of tweets times the average fanout.

Update 2: In an interview on July 19, 2011 Twitter CEO Dick Costolo said 1 billion tweets are sent every 5 days, or 200 million tweets per day. This is more in line with previous growth rates.

Wednesday, July 13, 2011

Essayists and Orators

Recently Kevin Rose redirected his eponymous domain to his Google+ profile, reflecting that "G+ gives me more (real-time) feedback and engagement than my blog ever did." Earlier this year Steve Rubel deleted thousands of blog posts from older TypePad and Posterous sites, and started afresh on Tumblr.

Moving the center of one's online presence to "where the action is" is not a new phenomena. In 2008 Robert Scoble essentially abandoned his own sites in order to spend time on Friendfeed, the hot new social networking site at that time. Techcrunch even attempted an intervention over the move. After the Facebook acquisition of FriendFeed the site gradually decayed through benign neglect. Scobleizer moved on long ago.

Why do this? Surely its better to own your own domain and control your destiny? Or is it.

Essayists And Orators

In this discussion we'll focus on people who are online for more than just casual interaction or journaling, who have specific goals they are trying to accomplish with their online presence.

Essayists publish thoughtful prose, focussed on a particular topic. Presentation and style is important, but generally secondary to the density of ideas within. The product of their labor comes slowly, and is intended to stand for considerable time.

Orators can also deliver thoughtful ideas and spend considerable time preparing for it, but the dynamics are very different. The pace is faster, the interaction more frequent with less time to consider. The delivery and ideas can be adjusted over time, with each new presentation.

Translated to their online equivalents, I think we can still recognize the Essayist and Orator archetypes based on what they want people to find when they search. The world is a larger place now, when we want to know something outside of our knowledge we search for it.

For an Essayist, the desired result is a post with thoughts on the topic, linked to their name. For an Orator, the desired result is a conclusion that the orator is knowledgeable about the topic.

For an Essayist its important to keep material available for people to find, and in a form which links back to the author. Considerable effort has been spent to provide value up front. If someone needs more they can contact the author, who can provide additional help freely or with suitable compensation. Hosting on one's own site allows the linking of authorship to original material, and provides a stable contact point.

For an Orator, its more important that people find the author's name as someone knowledgeable about the topic. An Orator seeks contact much earlier in the process than an Essayist. They want a followup search to be for their name, to find out how to contact them. This desire for contact earlier in the process implies that the Orator will interact freely on many topics. At some point, if the searcher becomes convinced they can benefit from the Orator's expertise, they may discuss terms for further help.

For an Orator, its less important to have a stable presence online. The desired result is for someone to seek them out personally, and even if they move from one site to another search engines can be depended on to find their most recent incarnation.

I suspect this categorization paints with too broad a brush, as no one corresponds exactly to either archetype, but I'm finding it useful to consider.

Pictures of Abraham Lincoln and Frederick Douglas.

Lincoln and Douglas pictures courtesy Wikimedia Commons. Both are in the public domain in the United States.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Social label.

Tuesday, July 12, 2011

CodeSynthesis XSD Data Binding

Nowadays I make a habit of writing up how to use particular tools or techniques for anything which might be useful to reference later. Many techniques I worked on before starting this practice are now lost to me, locked away in proprietary source code at some previous employer.

This post concerns data binding from XML schemas in C++, generating classes rather than manipulating the underlying XML. As its written for Future Me, it might not be so interesting to those who are not Future Me.

Consider the simple XML schema shown below. I aspire to be the Evil Overlord, and am working on the HR system to keep track of my innumerable minions.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="minion">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="rank" type="xs:string"/>
      <xs:element name="serial" type="xs:positiveInteger"/>
    </xs:sequence>
    <xs:attribute name="loyalty" type="xs:float" use="required"/>
  </xs:complexType>
</xs:element>

</xs:schema>

It would be possible to parse documents created from this schema manually, using something like libexpat or Xerces. Unfortunately as the schema becomes large, the likelihood of mistakes in this manual process becomes overwhelming.

I chose instead to work with CodeSynthesis XSD to generate classes from the schema, based mainly on the Free/Libre Open Source Software Exception in their license. This project will eventually be released under an Apache-style license, and all other data binding solutions I found for C++ were either GPL or a commercial license.

Parsing from XML

The generated code provides a number of function prototypes to parse XML from various sources, including iostreams.

std::istringstream agent_smith(
  "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>"
  "<minion xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" "
  "xsi:noNamespaceSchemaLocation=\"schema.xsd\" loyalty=\"0.2\">"
  "<name>Agent Smith</name>"
  "<rank>Member of Minion Staff</rank>"
  "<serial>2</serial>"
  "</minion>");
std::auto_ptr m(NULL);

try {
  m = minion_(agent_smith);
} catch (const xml_schema::exception& e) {
  std::cerr << e << std::endl;
  return;
}

The minion object now contains data members with proper C++ types for each XML node and attribute.

std::cout << "Name: " << m->name() << std::endl
          << "Loyalty: " << m->loyalty() << std::endl
          << "Rank: " << m->rank() << std::endl
          << "Serial number: " << m->serial() << std::endl;

Serialization to XML

Methods to serialize an object to XML are not generated by default, the --generate-serialization flag has to be passed to xsdcxx. This emits another series of minion_ methods, which take output arguments.

int main() {
  minion m("Salacious Crumb", "Senior Lackey", 1, 0.9);
  minion_(std::cout, m);
}

This sends the XML to stdout.

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<minion loyalty="0.9">
  <name>Salacious Crumb</name>
  <rank>Senior Lackey</rank>
  <serial>1</serial>
</minion>

Codesynthesis relies on Xerces-C++ to provide the lower layer XML handling, so all of the functionality of that library is also available to the application.

Thats enough for now. See you later, Future Me.

Friday, July 8, 2011

Hop By Hop TCP

Last week discussed how Ethernet CRCs don't cover what we think they cover. Surely the TCP checksum, as an end to end mechanism, provides at least a modicum of protection?

Unfortunately in today's Internet, no it doesn't. The TCP checksum is no longer end to end.

Path across the Internet showing every switch and link in green, protected by the TCP checksum.

Our mental model has the client sending packets all the way to the server at the other end. In reality there is a huge variety of gear which spoofs the TCP protocol or interposes itself in the connection, and the checksum is routinely discarded and regenerated before making it to the server. We'll look at some examples.

Load Balancers

Path across the Internet to the load balancer, which is shown in red. That is where the TCP checksum stops. In a typical datacenter, the servers sit behind a load balancer. The simplest such equipment distributes sessions without modifying the packets, but the market demands more sophisticated features like:

SSL offload, allowing the servers to handle unencrypted sessions
Optimized TCP Window for clients on broadband, dialup, or wireless.
Allow jumbo frames within the data center, and normal sized frames outside.

All of these features require the load balancer to be the endpoint of the TCP session from the client, and initiate an entirely separate TCP connection to the server. The two connections are linked in that when one side is slow the other will be flow controlled, but are otherwise independent. The TCP checksum inserted by the client is verified by the load balancer, then discarded. It doesn't go all the way to the server it is communicating with.

WAN Optimization

Path across the Internet via WAN optimizers, which are shown in red. That is where the TCP checksum stops. High speed wide area networks are expensive. If a one-time purchase of a magic box at each end can reduce the monthly cost of the connection between them, then there is an economic benefit to buying the magic box. WAN optimization products reduce the amount of traffic over the WAN using a number of techniques, most notably compression and deduplication.

The box watches the traffic sent by clients and modifies the data sent across the WAN. At the other end it restores the data to what was originally sent. In some modes it will carry the TCP checksums across the WAN to re-insert into the reconstructed packets. However the gear typically also offers features to tune TCP performance and window behavior (especially for satellite links), and these result in modifying the packets with calculation of new checksums.

NAT

NAT involves overwriting the IP addresses and TCP/UDP port numbers, which impacts the TCP checksum. However NAT is somewhat special: it is purely overwriting data, not expanding it. Therefore it can keep the existing checksum from the packet, subtract the old value from it, and add the new. As this is less expensive than to recalculate the checksum afresh, most NAT implementations do so.

Thus though the checksum is modified, most NAT implementations don't have the potential to corrupt data and then calculate a fresh checksum over the corruption. Yay!

HTTP Proxies

Path across the Internet through an HTTP proxy, which is shown in red. That is where the TCP checksum stops. Organizations set up HTTP proxies to enforce security policies, cache content, and a host of other reasons. Transparent HTTP proxies are setup without cooperation from the client machine, simply grabbing HTTP sessions which flow by on the network. For the web to work the proxy has to modify the data sent by the client, if only to add an X-Forwarded-For header. Because the proxy expands the amount of data sent, it ends up repacking data in frames. Therefore proxies generally calculate a new checksum, they can't just update it the way NAT boxes do.

Conclusion

The Internet has evolved considerably since TCP was invented, and our mental model no longer matches reality. Many connections end up repeatedly recalculating their checksum via a proxy at their own site, a WAN optimizer somewhere along the way, and a load balancer at the far end. The checksum has essentially been turned into a hop by hop mechanism.

This isn't necessarily bad, and it has allowed the Internet to expand to its current reach. The point of this post and the earlier musing is simply that the network cannot be depended on to ensure the integrity of data. End to end protection means including integrity checks within the payload.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Thursday, July 7, 2011

rel="twitter me"

Apparently Alyssa Milano used her Twitter account to verify her Google+ account.

Alyssa Milano tweet: this is my real Google+ account

XFN 1.1 defined a number of relationship attributes for links, one of which is rel="me"

rel="me" A link to yourself at a different URL. Exclusive of all other XFN values. Required symmetric. There is an implicit "me" relation from the contents of a directory to the directory itself.

Clearly, we need rel="me twitter" for situations like this.

Tuesday, July 5, 2011

On the Naming of Our Relationships

I've never been happy with the use of a single word to describe associations on a social network, whether that word is "friending" or "following." Human relationships have a huge range of possibilities, and we use subtle variations in wording and adjectives to add nuance to our descriptions of them. To take just one example, "godmother," "stepmother," and "mother" all convey a beloved relationship, yet with vastly different levels of parental (and biological) involvement.

On Twitter I've tried to use Lists to broadly categorize those I follow into groups, mostly focussed on topics. An account can have at most twenty lists, so I've tended to be sparing in creating them. As Twitter has recently de-emphasized lists in the UI, I don't anticipate much more development there.

In the past week I've spent a lot of time on the Google+ Field Trial. There are a number of things I like about it, but one favorite is that I get to name the Circles I use. I can choose the terminology to define our relationship, how I see it from my own perspective.

Friends, BFFs, Family, Extended Family circles.

I also use circles to focus on particular topics, or on groups I am associated with. I rarely post to these circles, mostly just read.

FriendFeeders, Googlers, Journalistas, Networking, Scobleizers.

Its very liberating to be able to put a name on one's associations.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Social label.