Coding Relic: architecture

Showing posts with label architecture. Show all posts

Monday, October 10, 2011

ld-http.so

In the last decade we have enjoyed a renaissance of programming language development. Clojure, Scala, Python, C#/F#/et al, Ruby (and Rails), Javascript, node.js, Haskell, Go, and the list goes on. Development of many of those languages started in the 1990s, but adoption accelerated in the 2000s.

Why now? There are probably a lot of reasons, but I want to opine on one.

HTTP is our program linker.

We no longer have to worry about linking to a gazillion libraries written in different languages, with all of the compatibility issues that entails. We no longer build large software systems by linking it all into ginormous binaries, and that loosens a straightjacket which made it difficult to stray too far from C. We dabbled with DCE/CORBA/SunRPC as a way to decouple systems, but RPC marshaling semantics still dragged in a bunch of assumptions about data types.

It took the web and the model of software as a service running on server farms to really decompose large systems into cooperating subsystems which could be implemented any way they like. Facebook can implement chat in Erlang, Akamai can use Clojure, Google can mix C++ with Java/Python/Go/etc. It is all connected together via HTTP, sometimes carrying SOAP or other RPCs, and sometimes with RESTful interfaces even inside the system.

Friday, August 26, 2011

QFabric Conclusion

This is the fourth and final article in a series exploring the Juniper QFabric. Earlier articles provided an overview, a discussion of link speed, and musings on flow control. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article attempts to cover a few loose ends, and wraps up the series.

As with previous days, the flow control post sparked an interesting discussion on Google+.

Whither Protocols?

Director connected to edge node and peered with another switch For a number of years switch and router manufacturers competed on protocol support, implementing various extensions to OSPF/BGP/SpanningTree/etc in their software. QFabric is almost completely silent about protocols. In part this is a marketing philosophy: positioning the QFabric as a distributed switch instead of a network means that the protocols running within the fabric are an implementation detail, not something to talk about. I don't know what protocols are run between the nodes of the QFabric, but I'm sure its not Spanning Tree and OSPF.

Yet QFabric will need to connect to other network elements at its edge, where the datacenter connects to the outside world. Presumably the routing protocols it needs are implemented in the QF/Director and piped over to whichever switch ports connect to the rest of the network. If there are multiple peering points, they need to communicate with the same entity and a common routing information base.

Flooding Frowned Upon

The edge Nodes have an L2 table holding 96K MAC addresses. This reinforces the notion that switching decisions are made at the ingress edge, every Node can know how to reach destination MAC addresses at every port. There are a few options for distributing MAC address information to all of the nodes, but I suspect that flooding unknown addresses to all ports is not the preferred mechanism. If flooding is allowed at all, it would be carefully controlled.

Much of modern datacenter design revolves around virtualization. The VMWare vCenter (or equivalent) is a single, authoritative source of topology information for virtual servers. By hooking to the VM management system, the QFabric Director could know the expected port and VLAN for each server MAC address. The Node L2 tables could be pre-populated accordingly.

By hooking to the VM management console QFabric could also coordinate VLANs, flow control settings, and other network settings with the virtual switches running in software.

NetOps Force Multiplier

Where previously network engineers would be configuring dozens of switches, QFabric now proposes to manage a single distributed switch. Done well, this should be a substantial time saver. There will of course be cases where the abstraction leaks and the individual Nodes have to be dealt with. The failure modes in a distributed switch are simply different. Its unlikely that a single line card within a chassis will unexpectedly lose power, but its almost certain that Nodes occasionally will. Nonetheless, the cost to operate QFabric seems promising.

Conclusion

QFabric is an impressive piece of work, clearly the result of several years effort. Though the Interconnects use merchant silicon, Juniper almost certainly started working with the manufacturer at the start of the project to ensure the chip would meet their needs.

The most interesting part of QFabric is its flow control mechanism, for which Juniper has made some pretty stunning claims. A flow control mechanism with fairness, no packet loss, and quick reaction to changes over such a large topology is an impressive feat.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Thursday, August 25, 2011

QFabric: Flow Control Considered Harmful

This is the third in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

Yesterday's post sparked an excellent discussion on Google+, disagreeing about overclocking of backplane links and suggesting LACP hashing as a reason for the requirement for 40 Gbps links. Ken Duda is a smart guy.

XON/XOFF

Ethernet includes (optional) link level flow control. An Ethernet device can signal its neighbor to stop sending traffic, requiring the neighbor to buffer packets until told it can resume sending. This flow control mechanism is universally implemented in the MAC hardware, not requiring software intervention on every flow control event.

Switch chip with one congested port flow controlling the entire link The flow control is for the whole link. For a switch with multiple downstream ports, there is no way to signal back to the sender that only some of the ports are congested. In this diagram, a single congested port requires the upstream to be flow controlled, even though the other port could accept more packets. Ethernet flow control suffers from head of line blocking, a single congested port will choke off traffic to other uncongested ports.

IEEE has tried to address this in two ways. Priority-based flow control, defined in IEEE 802.1Qbb, makes each of the eight classes of service flow control independently. Bulk data traffic would no longer block VoIP, for example. IEEE 802.1au is defining a congestion notification capability to send an explicit notification to compatible endstations, asking them to slow down. Per-priority pause is already appearing in some products. Congestion notification involves NICs and client operating systems, and is slower going.

QFabric

Juniper's whitepaper on QFabric has this to say about congestion:

Finally, the only apparent congestion is due to the limited bandwidth of ingress and egress interfaces and any congestion of egress interfaces does not affect ingress interfaces sending to non-congested interfaces; this noninterference is referred to as “non-blocking”

QFabric very clearly does not rely on Ethernet flow control for the links between edge nodes and interconnect, not even per-priority. It does something else, something which can reflect the queue state of an egress port on the far side of the fabric back to control the ingress. However, Juniper has said nothing that I can find about how this works. They rightly consider it part of their competitive advantage.

So... lets make stuff up. How might this be done?

Juniper has said that the Interconnect uses merchant switch silicon, but hasn't said anything about the edge nodes. As all the interesting stuff is done at the edge and Juniper has a substantial ASIC development capability, I'd assume they are using their own silicon there. Whatever mechanism they use for flow control would be implemented in the Nodes.

Most of the mechanisms I can think of require substantial buffering. QFX3500, the first QFabric edge node, has 4x40 Gbps ports connecting to the Interconnect. That is 20 gigabytes per second. A substantial bank of SRAM could buffer that traffic for a fraction of a second. For example, 256 Megabytes could absorb a bit over 13 milliseconds worth of traffic. 13 milliseconds provides a lot of time for real-time software to swing into action.

For example:

The memory could be used for egress buffering at the edge ports. Each Node could report its buffer status to all other nodes in the QFabric, several thousand times per second. Adding a few thousand packets per second from each switch is an insignificant load on the fabric. From there we can imagine a distributed flow control mechanism, which several thousand times per second would re-evaluate how much traffic it is allowed to send to each remote node in the fabric. Ethernet flow control frames could be sent to the sending edge port to slow it down.

Or rather, smarter people than me can imagine how to construct a distributed flow control mechanism. My brain hurts just thinking about it.
Hardware counters could track the rate each Node is receiving traffic from each other Node, and software could report this several times per second. Part of the memory could be used as uplink buffering, with traffic shapers controlling the rate sent to each remote Node. Software could adjust the traffic shaping several thousand times per second to achieve fairness.

Substantial uplink buffering also helps with oversubscribed uplinks. QFX3500 has 3:1 oversubscription.

I'll reiterate that the bullet points above are not how QFabric actually works, I made them up. Juniper isn't revealing how QFabric flow control works. If it achieves the results claimed it is a massive competitive advantage. I'm listing these to illustrate a point: a company willing to design and support its own switch silicon can take a wholly different approach from the rest of the industry. In the best case, they end up with an advantage for several years. In the worst case, they don't keep up with merchant silicon. I've been down both of those paths, the second is not fun.

There are precedents for the kind of systems described above: Infiniband reports queue information to neighbor IB nodes, and from this works out a flow control policy. Infiniband was designed in an earlier age when software wouldn't be able to handle the strict timing requirements, so it is defined in terms of hardware token buckets.

One thing I am sure of is that the QF/Director does not get involved in flow control. That would be a scalability and reliability problem. Juniper stated that the QFabric will continue to forward traffic even if the Director nodes have failed, the flow control mechanism cannot rely on the Directors.

Next article: wrap-up.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, August 24, 2011

The Backplane Goes To Eleven

This is the second in a series of posts exploring the Juniper QFabric. Juniper says the QFabric should not be thought of as a network but as one large distributed switch. This series examines techniques used in modular switch designs, and tries to apply them to the QFabric. This article focuses on link speeds.

How Fast is Fast?

Ethernet link speeds are rigidly specified by the IEEE 802.3 standards, for example at 1/10/40/100/etc Gbps. It is important that Ethernet implementations from different vendors be able to interoperate.

Backplane links face fewer constraints, as there is little requirement for interoperability between implementations. Even if one wanted to it isn't possible to plug a card from one vendor into another vendor's chassis, they simply don't fit. Therefore backplane links within a chassis have been free to tweak their links speeds for better performance. In the 10 Gbps generation of products backplane links have typically run at 12.5 Gbps. In the 40 Gbps Ethernet generation I'd expect 45 or 50 Gbps backplane links (I don't really know, I no longer work in that space). A well-designed SERDES for a particular speed will have a bit of headroom, enabling faster operation over high quality links like a backplane. Running them faster tends to be possible without heroic effort.

Switch chip with 48 x 1Gbps downlinks and 4 x 10/12.5 Gbps uplinks A common spec for switch silicon in the 1Gbps generation is 48 x 1 Gbps ports plus 4 x 10 Gbps. Depending on the product requirements, 10 Gbps ports can be used for server attachment or as uplinks to build into a larger switch. At first glance the chassis application appears to be somewhat oversubscribed, with 48 Gbs of downlink but only 40 Gbps of uplink. In reality, when used in a chassis the uplink ports will run at 12.5 Gbps to get 50 Gbps of uplink bandwidth.

Though improved throughput and capacity is a big reason for running the backplane links faster, there is a more subtle benefit as well. Ethernet links are specified to run at some nominal clock frequency, modulo a tolerance measured in parts per million. The crystals driving the links can be slightly on the high or low side of the desired frequency yet still be within spec. If the backplane links happen to run slightly below nominal while the front panel links are slightly higher, the result would be occasional packet loss simply because the backplane cannot keep up. When a customer notices packet loss during stress tests in their lab it is very, very difficult to convince them it is expected and acceptable. Running the backplane links at a faster rate avoids the problem entirely, the backplane can always absorb line rate traffic from a front panel port.

QFabric

This is one area where QFabric takes a different tack from a modular chassis: I doubt the links between QF/Node and Interconnect are overclocked even slightly. In QFabric those links are not board traces they are lasers, and lasers are more picky about their data rate. In the Packet Pushers podcast the Juniper folks stated that the uplinks are required to be faster than the downlinks. That is, if the edge connections are 10 Gbps then the uplinks from Node to Interconnect have to be 40 Gbps. One reason for this is to avoid overruns due to clock frequency variance, and ensure the Interconnect can always absorb the link rate from an edge port without resorting to flow control.

Next article: flow control.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Tuesday, August 23, 2011

Making Stuff Up About QFabric

This is the first of several articles about the Juniper QFabric. I have not been briefed by Juniper, nor received any NDA information. These articles are written based on Juniper's public statements and materials available on the web, supplemented with details from a special Packet Pushers podcast, and topped off with a healthy amount of speculation and guessing about how it works.

Juniper QFabric with Nodes, Interconnect, and Director QFabric consists of edge nodes wired to two or four extremely large QF/Interconnect chassis, all managed via out of band links to QF/Directors. Juniper emphasizes that the collection of nodes, interconnects, and directors should be thought of as a single distributed switch rather than as a network. Packet handling within the QFabric is intended to be opaque, and the individual nodes are not separately configured. It is supposed to behave like one enormous, geographically distributed switch.

Therefore to try to brainstorm about how the distributed QFabric works we should think in terms of how a modular switch works, and how its functions might be distributed.

Exactly One Forwarding Decision

Ingress line card with switch fabric, connected to central supervisor with fabric, connected to egress line card with fabric Modular Ethernet switches have line cards which can switch between ports on the card, with fabric cards (also commonly called supervisory modules, route modules, or MSMs) between line cards. One might assume that each level of switching would function like we expect Ethernet switches to work, forwarding based on the L2 or L3 destination address. There are a number of reasons why this doesn't work very well, most troublesome of which are the consistency issues. There is a delay between when a packet is processed by the ingress line card and the fabric, and between the fabric and egress. The L2 and L3 tables can change between the time a packet hits one level of switching and the next, and its very, very hard to design a robust switching platform with so many corner cases and race conditions to worry about.

Control header prepended to frame Therefore all Ethernet switch silicon I know of relies on control headers prepended to the packet. A forwarding decision is made at exactly one place in the system, generally either the ingress line card or the central fabric cards. The forwarding decision includes any rewrites or tunnel encapsulations to be done, and determines the egress port. A header is prepended to the packet for the rest of its trip through the chassis, telling all remaining switch chips what to do with it. To avoid impacting the forwarding rate, these headers replace part of the Ethernet preamble.

Control header prepended to frame Generally the chips are configured to use these prepended control headers only on backplane links, and drop the header before the packet leaves the chassis. There are some exceptions where control headers are carried over external links to another box. Several companies sell variations on the port extender, a set of additional ports to be controlled remotely by a chassis switch. The link to the port extender will carry the control headers which would otherwise be restricted to the backplane. Similarly, several vendors sell stackable switches. Each unit in the stack can function as an independent switch, but can be connected via stack ports on the back to function together like a larger switch. The stack ports carry the prepended control headers from one stack member to the next, so the entire collection can function like a single forwarding plane.

QFabric

In the Packet Pushers podcast and in an article on EtherealMind, the Interconnect is described as a Clos network with stages of the Clos implemented in cards in the front and back of the chassis. It is implemented using merchant silicon, not Juniper ASICs. The technology in the edge Node was not specified, it is my assumption that Juniper uses its own silicon there.

Forwarding decisions are made in the Nodes and sent to the Interconnect, which is is a pure fabric with no decision making. This would be implemented by having the Nodes send control headers on their uplinks, in a format compatible with whatever merchant silicon is used in the Interconnect plus additional information needed to support the QFabric features. Juniper would not allow themselves to be locked in to a particular chip supplier, I'm sure the QF/Node implementation would be very flexible in how it creates those headers. A new QF/Interconnect with a different chipset would be supportable via a firmware upgrade to the edge nodes.

The QF/Interconnect would in turn forward the packet to its destination with the control header intact. The destination switch would perform whatever handling was indicated in the control information, discard the extra header, and forward the packet out the egress port.

Oversubscribed QF/Nodes

One interesting aspect of the first generation QF/Node is that it is oversubscribed. The QFX3500 has 480 Gbps of downlink capacity, in the form of 48 x 10G ports. It has 160 Gbps of uplink, via 4 x 40Gbps ports. Oversubscribed line cards are not unheard of in module chassis architectures, though it is generally the result of a followon generation of cards outstrips the capacity of the backplane. There have been chassis designs where the line cards were deliberately oversubscribed, but they are somewhat less common.

QFabric has a very impressive system to handle congestion and flow control, which will be the topic of a future article. The oversubscribed uplink is a slightly different variation, but in the end is really another form of congestion for the fabric to deal with. It would buffer what it can, and assert Ethernet flow control and/or some other means of backpressure to the edge ports if necessary.

Next article: link speeds.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Monday, October 11, 2010

On the Road to Self Driving Cars

Cars driving down a highway As it was located near the center of the US auto industry, there was an extensive automotive program at the University of Michigan (Ann Arbor) with an assortment of guest speakers from the Big Three. I went to several presentations that made quite an impression. One of them was about self-driving cars... in 1991.

The system described then relied on sensors attached to the bottom of the vehicle. Major highways would be equipped with copper wires running down the center of each lane, which the car would track in order to correct its course. I don't recall if the wire would actively broadcast a signal or be passively detected, nor how they would avoid running into other cars. As only major highways would be thus equipped, the driver had to take over in order to exit the highway and transit surface streets.

The presenter at that time was emphatic that the technology would be deployed within 10 years, because the economics were compelling. It was provably cheaper to increase the carrying capacity of highways using this system than by adding lanes. The wires were rapidly installed by making a narrow slit down the roadway, inserting a flexible conduit, and sealing the road behind. It was the same process as was being used to run fiber optics across the nation at that time, and was well understood. The added cost to vehicles would be subsidized using money saved from highway budgets. After paying for road retrofits and vehicle subsidies, the system would still be substantially cheaper than the status quo.

Of course, no such scheme made it out of the test facilities. Twenty years later, self-driving car designs no longer rely on modifications to the roads. Now the cars have an extensive map of the expected topology and navigate by comparing what they sense with what they expect.

I think there are several lessons in this.

Any scheme requiring massive investment in infrastructure before benefits are seen is almost certainly doomed to fail. Large changes in infrastructure can best be accomplished incrementally, where a small investment brings a small benefit and continuing investment brings more benefit. It is far better to deploy self-driving cars and map roadways one at a time, without requiring a critical mass of highways and automobiles be deployed.
Requiring multiple investments to be made by different parties invariably leads to deadlock. Car makers wouldn't add the equipment to vehicles until there was a sufficient base of wired roads for their use. States wouldn't wire the roads until there was a sufficient population of suitable cars.
It is easy to design something to fit the infrastructure we wish we had, rather than what we really have, without realizing it. By focussing overmuch on the end state, one ignores the difficulties in getting from here to there.

Each such lesson has been shown over and over, of course. We continue to make the last mistake all the time on the Web, designing solutions which work fine except for NAT, or HTTP proxies, or URL shorteners, or some other grungy but essential detail of how the Internet actually functions.

Wednesday, August 4, 2010

node.js from 30,000 feet

I attended a tech talk last week on node.js by Ryah Dahl. The video of the talk is up on YouTube.

Node.js Overall Structure

V8+libev+libeio+libcares at the bottom, node bindings in the middle, top layer node.js standard library in JavaScript The JavaScript implementation in node.js is Google's V8. As mentioned in an earlier article, V8 compiles the source JavaScript directly to machine code the first time it is executed. There is no intermediate bytecode format and no JS interpreter. In addition to V8, node.js relies on libev for its event loop, libeio for asynchronous I/O, and c-ares for asynchronous DNS support. Like everything else in the known universe, it relies on OpenSSL for cryptography and SSL/TLS support.

A standard library in JavaScript is supplied. This provides access to the underlying C++ implementation, and also has helpful bits like a URL parser and a REPL shell for easy experimentation. One thing it does not provide is the DOM. Node.js is not a browser, there is no HTML document to interact with.

Node.js Implementation

Entry points to the C++ code appear as a JavaScript variable named process. For example, here is an excerpt from dns.js:

var dns = process.binding('cares');

'cares' refers to the c-ares DNS support library. The dns variable allows JavaScript code to make calls to c-areas.

// Easy DNS A/AAAA look up
exports.lookup = function (domain, callback) {

Notice the signature of the function: input arguments and a callback when finished. There are never blocking operations in node, everything which might not complete immediately is a callback.

  var addressType = dns.isIP(domain);
  if (addressType) {
    process.nextTick(function () {
      callback(null, domain, addressType);
    });

dns.isIP() calls into C++ code, which makes a series of inet_pton(AF_INET*) calls to figure out if the argument is a valid numeric IP address. I've omitted the C++ code here, we dive into a more interesting example below.

  } else {
    if (/\w\.local\.?$/.test(domain) ) {
      // ANNOYING: In the case of mDNS domains use NSS in the thread pool.
      // I wish c-ares had better support.
      process.binding('net').getaddrinfo(domain, 4, function (err, domains4) {
        callback(err, domains4[0], 4);
      });

Node.js has two ways to implement support routines in C++. If the C++ code is structured to be asynchronous with a callback, it can be launched from the main thread using libev. Node.js makes heavy use of async I/O for this reason. Blocking C++ calls are handled by a pool of worker threads, which send an event to the main when their operation completes. In this code snippet the 'local' domain is handled by the thread pool as a special case, because c-ares doesn't handle mDNS.

We'll come back to the thread pool code path later, after examining the common case.

    } else {
      channel.getHostByName(domain, dns.AF_INET, function (err, domains4) {
        if (domains4 && domains4.length) {
          callback(null, domains4[0], 4);
        } else {
          channel.getHostByName(domain, dns.AF_INET6, function (err, domains6) {
            if (domains6 && domains6.length) {
              callback(null, domains6[0], 6);
            } else {
              callback(err, []);
            }
          });
        }
      });
      ... etc ...

"channel" is a JavaScript variable which links to a context in the c-ares library. The JS code to create channel is omitted for brevity. Now we'll peel back one layer to look at the C++ implementation.

Handle Channel::GetHostByName(const Arguments& args) {
  HandleScope scope;
  Channel *c = ObjectWrap::Unwrap(args.Holder());
  assert(c);

  if (!args[0]->IsString()) {
    return ThrowException(Exception::Error(
          String::New("First argument must be a name")));
  }

  if (!args[1]->IsInt32()) {
    return ThrowException(Exception::Error(
          String::New("Second argument must be a family")));
  }

  if (!args[2]->IsFunction()) {
    return ThrowException(Exception::Error(
          String::New("Third argument must be a callback")));
  }

  int family = args[1]->Int32Value();
  if (family != AF_INET6 && family != AF_INET) {
    return ThrowException(Exception::Error(
          String::New("Unsupported address family")));
  }

Argument unwrapping and validity checks when traversing the interface from one programming language are always tedious. You can never predict when someone will copy the channel.getHostByName invocation out of the standard library and mess with it, and you'd like the framework to do something sane no matter what they do.

  String::Utf8Value name(args[0]->ToString());

  ares_gethostbyname(c->channel, *name, family, HostByNameCb, cb_persist(args[2]));

  return Undefined();
}

Thats it. ares_gethostbyname() is in the C-ARES library, which we won't delve into here. HostByNameCb is the C++ callback function when resolution is done. HostByNameCb injects an event to the JavaScript code, to call the callback function passed in to the original call.

The JavaScript has an alternate code path for mDNS requests, using the getaddrinfo() method on process.binding('net'). Most of that code path consists of the same sort of argument unwrapping and checking as GetHostByName, which we will omit. The mDNS code path uses a blocking DNS request, serviced by the thread pool. The code to send work to the pool and arrange a callback later is pleasingly simple:

  eio_custom(Resolve, EIO_PRI_DEFAULT, AfterResolve, rreq);

Resolve is the function the worker thread is supposed to call. AfterResolve is the callback function in the main loop which the worker thread should trigger when done.

Final Thoughts

Node.js makes it easy to develop high performance applications by not offering APIs which would drastically lower performance. Everything is a callback, there are no blocking calls in the API (except for initialization calls such as module loading). Where the underlying C++ implementation is also based on callbacks, this is straightforward. Where the underlying C++ code would block, the implementation becomes a somewhat more difficult exercise in thread management.

The JavaScript API in node.js "feels" very much like JavaScript. I believe a main factor making this possible is the relatively small number of entry points required from the JavaScript down into the C++ code: sockets, DNS resolution, the http parsing library, etc. It was feasible for each interface to be lovingly crafted by hand, baking JavaScriptiness into the API.

Attempting this technique for software like GUIs, where the number of C/C++ APIs to bind to is enormous, would likely require a more automated linkage between JavaScript and C++. This is the world of things like SWIG to generate interfaces or libffi to make direct calls. SWIG and libffiare extremely useful in their niches, but definitely have the feel of a foreign intruder in the host language. I don't know that a node.js for GUIs would be as pleasant a thing to look upon, but we need a way to do so. Software needs to advance without having to continually reinvent and reimplement what has come before, and without requiring drastic amounts of manual effort.

Thursday, April 8, 2010

Simple Checksums Considered Harmful

Lets talk about iSCSI for a moment, as a launching point for a discussion about data integrity. iSCSI relies on CRC32 to catch data corruption. CRC32 is a good fit for this purpose, but most previous uses of it had been confined to very low levels of the system and implemented in hardware. iSCSI uses CRC32 way up in the protocol header, where it is generally computed in software. The overhead of computing the CRC is one reason why so many hardware offload adaptors were developed for iSCSI.

Intel recently released a whitepaper describing how they achieved 1 million iSCSI operations per second. One fascinating tidbit is that the CRC32 is no longer a bottleneck. The Nehalem architecture includes an instruction to compute it directly, as part of SSE 4.2. The new instruction is described in the Intel64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M. It is on page 3-221 of the December 2009 edition; search for CRC32 in later editions.

CRC32 r32, r/m8	Accumulate CRC32 on r/m8
CRC32 r32, r/m16	Accumulate CRC32 on r/m16
CRC32 r32, r/m32	Accumulate CRC32 on r/m32
CRC32 r64, r/m8	Accumulate CRC32 on r/m8
CRC32 r64, r/m64	Accumulate CRC32 on r/m64

Thats it. You load words from memory and hand them to the CRC32 instruction. If you were already making a pass over the data for any reason, the CRC calculation is free. Table-driven CRC generation implementations were already fast, but this is even faster.

What does this mean? I think it means weak checksums should no longer be used for anything. Applications which care about data integrity moved to MD5 or SHA1 years ago, but you still see specifications in other contexts written to use Adler-32 or even the venerable 16-bit TCP checksum. Its not appropriate to use these any more. Server CPUs can compute CRC32 for free, and embedded CPUs have long included CRC32 calculation in DMA engines.

Wednesday, March 24, 2010

Player Piano Torpedoes

March 24, 2010 is Ada Lovelace day, an informal holiday to celebrate the achievements of women in technology and science. I'd like to share a fascinating technology story about Hedy Lamarr. Ms Lamarr was a contract star at MGM during the Golden Age of Hollywood, in the 1930s and 40s. She was also a creative and mathematically talented inventor. Today, we would proudly call her a geek.

From US Patent 2,292,387, by Hedy Kiesler Markey and George Antheil:

"This invention relates broadly to secret communication systems involving the use of carrier waves of different frequencies, and is especially useful in the remote control of dirigible craft, such as torpedoes.

Our system... employs a pair of synchronous records, one at the transmitting station and one at the receiving station, which change the tuning of the transmitting and receiving apparatus from time to time..."

Two signals are sent, labelled L and R and controlling the left and right rudders of the torpedo. L is indicated by sending a 100 Hz signal over a carrier, R by 500 Hz. Remotely controlled torpedoes had been used before the 1940s, but were often jammed by the target because the control frequency was relatively easy to detect. The innovation in this patent is the use of perforated rolls of paper to modulate the frequency rapidly enough that the enemy would not be able to predict it, making jamming difficult. The perforated rolls of paper were commonly used in player pianos of the time, requiring no special development.

In the patent application seven rows of perforations were used to control the frequency of the carrier. An eighth row of perforations lights a small lamp at the transmitting station. Three of the seven transmission frequencies were dummies which would not actually be received by the torpedo, while the lamp informed the torpedo operator when the weapon was out of contact. The intent of the dummy frequencies appears to be to mislead the enemy and make it more difficult to determine how the control system worked. Some seemingly valid transmission would not be acted upon by the torpedo, while others would.

Rows A-G tune the radio to one of 7 frequencies.
Row H controls a lamp for the operator when the dummy frequencies A-C are in use.

For the transmitter and receiver to frequency hop in sync, the tape reels must begin rolling at very close to the same time and the speed of the winding must have a reasonably tight tolerance. Machined springs available in the 1930s were sufficiently precise to maintain this for several minutes, long enough to guide a torpedo to its target.

All in all its a fascinating invention which repurposed existing technology for a new purpose, in fighting the Pacific War. Unfortunately the rest of the story is not a happy one, as the invention was not taken seriously by the War Department. By the time the communication industry reinvented spread spectrum communications in the 1950s, this patent had expired.

In 1997 the EFF recognized Ms Lamarr and Mr Antheil's achievement with a Pioneer award.

Thursday, February 18, 2010

Email Contact Virii

The Matrix image We're all familiar with computer viruses: bits of code which attempt to transmit themselves to another system, from which they look for yet more systems to infect. The key point is that the virus must transmit code to infect another system.

Very occasionally there are more obscure examples of viral behavior, of data alone unaccompanied by code. Certainly, there is code involved, but it is the normal code of the system which performs a useful function and is not intended to spread data virally.

The contacts viral spread For example: people usually refer to me by the nickname Denny. My personal email address is denny@geekhold.com (1) but my work address is not "denny". One person had my personal email address. Shortly after I started he sent an email to a few team members. He typed "Denny," and GMail helpfully auto-completed my personal account from his contacts. When another member of the team responded to that email, the address was automatically added to his contacts. When he later sent an email to another set of coworkers... the Contact virus spread further.

The trouble with data-driven viral behavior is that there are few tools to stamp it out. With code viruses, a huge ecosystem of tools and malware signatures has been created. There are few tools to deal with an annoying bit of data spreading through the system, just manual exhortations to not respond to the email containing the external address.

Do people know of other instances of a viral spread of data, unaccompanied by code? There are certainly instances of poor auto-complete behavior, such as a mistyped URL leading to the browser forevermore suggesting the bogus site, but none I can think of which spread from one person to another. I suspect the root cause is a poor model for how auto-complete is supposed to work: the developer wants it to be completely automatic and not something the user should have to adjust.

(1): I long ago gave up keeping my email address off of spam lists. As I've used this address since 1996, I am wholly reliant on good filtering.

Friday, February 5, 2010

Apple == A, Plus Four More Letters

What the web needs right now is another blog post about the iPad.

Apple A4 chip No, don't run away! This will be different, I promise. We'll focus on Apple's A4, a custom CPU first used in the iPad. It has been widely assumed that A4 uses a licensed ARM Cortex A8 core. I have no reason to dispute this assertion, it seems like a fine choice. It has also been asserted that because the same core is used in parts from Samsung, TI, and Qualcomm, Apple should not have bothered making its own chip. Today, Gentle Reader, we'll explore that notion a bit.

Apple ASIC Expertise

Apple has a long history of ASIC design. Apple produced custom silicon for various Macintosh models since at least the late 1980s, when they designed the audio chips used in the Quadra 700 and 900 (a chip called "Batman"). Later, Apple designed entire chipsets to interface with the PowerPC 60x bus. Apple licensed a gigabit Ethernet MAC design from Sun, and used it plus IDE controller and other peripherals in chipsets for several Powermac models. With the switch to x86, Apple's efforts became much more constrained. The x86 bus interface is difficult to license, and Intel's own chipsets are quite reasonable. So far as I know, x86 Macintoshes no longer use custom Apple ASICs.

Custom chip design isn't a radical departure for Apple.

With that as background, what might Apple have done in the A4 chip? I have absolutely no inside information about the iPad or A4 processor, I'm going to make stuff up because its fun to speculate.

Graphics & OpenCL

PowerVR SGX CPU Overview Apple holds a nearly 10% stake in Imagination Technologies Group, which designs the PowerVR graphics accelerator and other IP relating to massively threaded processing. Apple uses their PowerVR SGX 535 in iPhone 3GS, and used various PowerVR graphics in earlier iPhone models. The A4 chip will certainly integrate a graphics core from PowerVR. As with essentially all GPU designs today, the PowerVR makes use of multiple, specialized CPU cores. There is relatively little information about its instruction set on the web, ~~only that it is called META MTX and uses 16 bit RISC-ish instruction words~~. Update: PowerVR SGX does not use the META architecture, it has a distinct architecture of its own. Additional information can be downloaded after registration.

Apple has also invested heavily in two relevant technologies: OpenCL and LLVM.

OpenCL allows processing to be distributed across multiple CPUs in the system, even if they have different instruction sets. OpenCL algorithms are written in a language with syntax very similar to C99, and the framework handles the rest.
LLVM is a compiler toolkit, one aspect of which is a machine independent instruction set. Source code can be compiled to the LLVM virtual machine, and from there be translated into the equivalent opcodes for the target CPU. The compilation can be done statically before running it, or by a Just-In-Time compiler while interpreting the LLVM bytecodes.

iPhone applications are compiled to ARM instructions, but it is not much of a stretch to imagine support for sections of LLVM bytecodes as well. If the hardware has sufficient GPU power, the bytecode could be translated to the GPU instruction set and offloaded. Devices with less sophisticated GPUs would use the ARM instead. Apple does not allow iPhone apps to include their own virtual machine in this way, but would be free to provide the VM function as part of the OS.

I suspect this is the most compelling reason for Apple to build its own chip as opposed to buying off the shelf. The rest of the mobile industry is satisfied to offload 3D graphics and video decoding to the GPU. Apple has greater ambitions, and could make use of significantly more GPU pipelines. By controlling the complete platform from CPU to software, Apple can make tradeoffs which are not practical for the rest of the market. For example: a very large GPU plus very fast ARM would generate more heat than can be dissipated in a small form factor like a phone. Apple has the option to dynamically throttle the ARM clock speed in order to open up more thermal envelope for the GPUs, if sufficient OpenCL workload is ready to run. When the GPUs are less busy, the ARM clock speed can be brought back up.

Multi Package Modules

The CPU in the iPhone 3GS is a Samsung S5PC100. This is a multi package module with CPU, I/O chip, and SDRAM sandwiched tightly together. Multi chip modules have been around for a long time, where multiple dies wired together in one big package. The amount of testing which can be done on a raw die is rather limited, so MCM yields suffer as one bad die ruins the whole assembly.

Multi package modules are relatively new: each chip is in its own package, but use very tight pin spacings and do not have a heat spreader. They are soldered together on a small PCB, which in turn has a Ball Grid Array on the bottom with normal pin spacings. Because each chip is packaged separately a full suite of test vectors can be run before the final assembly is put together, improving the yield considerably and lowering the cost of the final product.

If we examine the main board of an iPhone 3GS, the largest component is not the Samsung processor - it is the Flash memory, an MPM containing a number of flash chips. The Samsung CPU in the iPhone 3GS comes in a close second in size, and is also a multi package module with CPU, I/O chip, and SDRAM.

With the A4, Apple will probably have one die containing both CPU and I/O. Samsung uses different I/O chips to tailor their offering to many market segments, which is not a goal for Apple. By arranging the pinout carefully, Apple might be able to make an MPM containing CPU, SDRAM, and Flash, reducing the total board area. Different Flash capacities could be offered by not stuffing portions of the MPM. The iPad itself might not need such an MPM as it is a much larger device, but future iPhones would benefit more.

To be clear: assembling an MPM is not something you can easily do when buying merchant silicon. The pins on one package have to be arranged so as to be easy to route to the pins on the other packages within the MPM.

Wrapping up

I'll say it again: I made this all up. I have no information on the specifics of the A4, just speculation. In the unlikely event that anyone reads this (instead of running away from yet another iPad blog post), don't copy it into Wikipedia as though it were verified information.

What about future iterations? Its tempting to consider a single chip containing the entire iPhone feature set, including radio and wireless networking. The A4 itself clearly doesn't do this, as GSM support is optional in the iPad. I suspect that even in future chips, Apple won't pull in the baseband radio. The front end portions of that chip are rather sensitive to noise, and generally don't work well when integrated in the corner of a gigantic ASIC. Also integrating the radio functionality would make it that much harder to keep up with advancements in wireless networks.

Another future possibility is to use this chip in Apple's other small form factor products, like the Airport Base Stations, Time Capsule, and AppleTV. This is certainly possible, but aside from obvious additional peripherals like SATA I'm not sure it adds many requirements to the chip.

Other articles about A4 you might find interesting:

The New York Times writes about the history of the A4 design team.
Louis Gray writes about Apple's heavy recruiting push to staff up their ASIC team.
Wikipedia already has an article, which will improve over time as more details emerge.

P.S.: While we're at it, the title of this post is a guess about the origins of the "A4" nomenclature: "Apple" is a capital A followed by four more letters.

Wednesday, January 6, 2010

Y2.01K

Y2K was a big deal. A huge amount of money was spent updating older software which stored dates using only two digits. A common solution at the time was to treat dates of 00-09 as post-Y2K, and 10-99 as pre-Y2K. Effectively this only delayed the problem for ten years - our ten years ran out a few days ago. Some examples of failures experienced since Jan 1, 2010:

SpamAssassin rejects mail with a 20XX date.
Symantec Endpoint Protection Manager rejects all updates dated after Dec 31, 2009. As a workaround before a fix is ready, they are releasing updates with a 2009 date.
30 million German debit cards stopped working on January 1.
SAP Spool queue entries are not automatically cleaned up if they expire after 12/31/2009, eventually filling up the disk.
Some Windows Mobile handsets and various Point-of-Sale terminals insist that it is actually 2016. This is an odd symptom, and may not be related to a Y2K fix.

A huge amount of money and effort was expended in the 1990s to upgrade systems for Y2K. When Jan 1 rolled around and civilization did not end, there was much criticism that the whole thing had been hyped by vendors eager to sell new gear. Quite likely we overspent... but I believe the consequences of underspending would have been more expensive to clean up afterwards.

Fortunately, it appears that magic pixie dust is available to fix all Y2.01K problems, according to a press release from BogusTech. Obviously a spoof, its very funny and well worth reading.

Monday, December 28, 2009

Crowdsourcing Backup

Jeff Atwood recently suffered a catastrophic loss of data of his long-running blog Coding Horror. The site was running on a virtual machine, and apparently VM backups at the hosting provider had been routinely failing for years without anybody noticing. Jeff maintained his own backups... within the VM itself, which were lost when the VM was lost. Jeff's story has a happy ending as one of his readers, Carmine Paolino, had a complete archive.

Obviously the happenstance of somebody on the Internet having a complete copy of data important to us does not constitute a practical backup strategy, but it got me to thinking about the idea of crowdsourcing backups. Everybody should have offsite backups, but practically nobody does it. Could a system be designed where each participant wanting to back up their most important data would in return offer a chunk of local disk space to use for storing data for other people?

With terabyte drives becoming common, it seems like many systems have an abundance of disk space which could be better taken advantage of. Perhaps the data you want to be backed up can be broken into chunks and stored in the free space of a number of other backup users, while your drive simultaneously stores their data.

Your data would have to be encrypted, as it will be stored on media controlled by random and potentially untrustworthy people.
A large amount of redundancy would have to be baked in, as people could drop out of the system at any time and take a chunk of stored information away. Many copies of each chunk would be stored in multiple places.
Forward Error Correction would also be good, to further improve survivability in the face of missing data. Recovering most of the chunks would be sufficient to reconstruct the rest.

The practicality of the details aside, with Amazon, RackSpace and others offering cloud storage options, would it even be worthwhile to construct such a crowdsourced system? In 2010, I'm not sure that it is. I suspect this is an idea whose time has come... and gone.

Thursday, December 10, 2009

Untouchable Code

Behold: the Blaupunkt CD50.

Perhaps "behold" is too pretentious for a basic car stereo, but it is the topic of today's screed so I feel a dramatic introduction is called for. Let me call your attention to four buttons on the left side: RDS, AM, FM, and CD-C. They do what you might expect:

RDS - enable decode of station and song identification from an FM signal. I'm not sure why you'd ever disable this.
FM - switch to FM radio.
AM - switch to AM radio.
CD-C - switch to CD Changer mode. Once in CD-C mode, the RDS/FM/AM buttons have no effect until you push CD-C to get back to Radio mode.

This makes perfect sense, right? I mean really, once I'm in the CD player mode I wouldn't expect the buttons from Radio mode to do anything, would I? Yes, this is sarcasm. On the web. Dangerous, I know.

I'd speculate that the fine engineers at Blaupunkt did not actually want the user interface to be this way, and that they would have preferred the FM button to always switch to FM radio. I suspect they were presented with an existing AM/FM radio design which, for whatever reason, they could not modify. Perhaps the CD50 project had a very tight budget, or a narrow market window which didn't allow time to tweak the radio components. Less charitably, perhaps the radio design had degenerated into an unmaintainable mess and any change risked breaking the whole thing. The path of least resistance to get the product out is a mux: you're either in our new mode where we add all the shiny new goodness, or the crufty old mode where we haven't touched anything from the existing design.

The situation of an unmaintainable portion of a system should be familiar to any software developer tasked with working on a large codebase. I suspect the natural entropic state of software is unmaintainability, requiring constant infusions of energy to stave it off a while longer.

So what can we do to ensure systems remain maintainable? Unit testing is frequently suggested as an answer, though I've never been a fan of extensive unit testing. If the target platform is very different from the build system, structuring the code to be able to run unit tests is a non-trivial amount of extra work. However I've recently started working in an environment where development testing is strongly encouraged, and I have to admit it does help in keeping code maintainable as developers come and go. The lowest level unit tests are not terribly useful in this regard; even code in a complete mess will have unit tests. On the other hand, a functional testbench for a module where the interfaces to the rest of the system are mocked out is very helpful. You have a much firmer grasp of how changes you make in the module are going to impact the rest of the system. You also have more hope of being able to reimplement the module, as its interfaces are described by the mock framework. If other portions of the system reach in to the internals of the module without using the interfaces... then the cancer has already metastasized and you're probably doomed.

People say that refactoring early and often will keep code maintainable, but that requires agreement from management to spend development time paying off technical debt without an increase in marketable features. In a product environment I rarely win that argument. However, I've now worked on a complete re-implementation of two different systems where the bug load had simply become impossible due to indecipherable engineering. I suppose that is an extreme form of refactoring: extract the best bits of the old system, and throw out the rest.

Do you have tips for keeping code maintainable across multiple generations of products? This site uses Disqus for comments. You can comment anonymously if you wish, or use an existing identity like Twitter, Facebook, or any OpenID provider.

Thursday, December 3, 2009

Memory Matters

PowerPC I once worked on a system where one module was developed outside using Linux/x86 systems, brought in-house, and compiled for Linux/PowerPC. We thought we had been careful in the specifications: avoid endianness assumptions, limit memory footprint, and assume a hefty derating for the slower PowerPC used in the real system. Things looked good in initial testing, but when we started internal dogfooding the PowerPC performance dropped off the proverbial cliff. An operation that took 100 msec on the x86 development system and 300 msec during initial PowerPC testing regressed to an astonishing 45 seconds in the dogfood deployment.

The cause of this disparity was the data cache. For reasons unclear this code iterated through its configuration many, many times. On x86 the various levels of D$ comprise several megabytes, but the PowerPC had only 16K. As the dogfooding progressed and the config grew it resulted in unbelievable cache thrashing and a 2.5 order of magnitude performance drop.

Several years ago Ulrich Drepper wrote an excellent paper about all things related to memory in modern system architectures, especially x86 but relevant everywhere. It is a long read, but very worthwhile. The complete paper is available as a PDF from his site, and it was also serialized in articles on LWN.

Introduction
CPU caches
Virtual memory
NUMA systems - local versus remote references
What programmers can do - cache optimization
What programmers can do - multi-threaded optimizations
Memory performance tools
Future technologies
Appendices and bibliography

I downloaded the PDF and read it over the course of a few weeks. I strongly recommend this paper, the information content is very high.

Tuesday, November 10, 2009

Ethernet Integrity, or the Lack Thereof

Have you heard any variation of this claim?

We don't need our own integrity check. The TCP checksum is pretty weak, but Ethernet uses a ludicrously strong CRC. Even if you don't trust the TCP checksum, Ethernet will detect any errors.

Let's dig into this a bit, shall we?

Ethernet switch diagram A modern switch fabric chip is designed for both L2 ethernet switching and L3 IP routing. The additional logic for IP routing adds relatively little area in modern silicon technologies, while not having a routing capability would put them at a competitive disadvantage. Essentially all ethernet fabric chips, even those inside relatively cheap L2 switches, have the design features to route IPv4 traffic to at least a basic degree.

When a packet arrives at the input port (A) its CRC will be checked and the packet discarded if corrupt. If the packet is destined to the router's MAC address, its destination IP address will be looked up for L3 routing (C). An L3 router modifies the packet as part of its function, by decrementing the IP TTL and replacing the L2 destination with that of the next hop. Therefore a fresh CRC has to be regenerated at egress.

Even if the packet is to be switched at L2 (B), there are cases where the packet is modified. For example server machines and switch uplinks often handle multiple vlans, so their ports will be configured for tagging (D). Addition of the vlan tag requires the packet CRC to be recalculated on egress (E).

The point of this description? There are numerous cases at both L2 and L3 where a packet CRC cannot be preserved through the switch and will need to be regenerated at egress. ASIC designers hate special cases, as they add logic and test cases to the design. Because there are cases where the CRC must be regenerated, modern switch fabrics always regenerate the CRC at egress. Even if the packet has not been modified, even if the ingress CRC could have been preserved, it is discarded at ingress and regenerated at egress.

It bears repeating that this is a function of the chip, not the specific product. Even the tiny ethernet switches sold for practically nothing at retail use chips which contain basic vlan tagging and IP routing features (even if that product doesn't use them), and regenerate the CRC on every packet. The fabric chip they use wasn't specifically designed for such low cost switches, there is not enough profit to justify the effort. In addition to simple L2 switches those chips can be used to build NAT appliances, as the ethernet fan-out for small wireless access points, in DSL and cable routers, for low end WAN routers, etc. When only basic L2 switching is desired these fabrics can function completely standalone without a management CPU, reducing BOM cost to the bare minimum. Addition of a CPU allows the basic L3 functions to be used in the more featureful (but still low end) products.

Impact

What does this mean? The internal memories and logic paths within the switch are not covered by the ethernet CRC, it does not provide end-to-end protection. The switch might implement ECC over the whole path, but this is not common. The packet buffers are generally large enough to justify ECC, but miscellaneous FIFOs are more likely to have simple parity and logic elements often have no protection at all. It only takes one soft error to corrupt the packet contents, and then a fresh CRC will be calculated over the corrupted data.

If you care about the data you send over the network, you should include your own integrity check at the application level. This is another good argument for using SSL: not only do you protect privacy by encrypting the data, you also get a strong end-to-end integrity check.

Tuesday, November 3, 2009

Delivering Apple TV Around the Planet

I am a long time Macintosh user, and though I don't generally write about Apple products I'm going to branch out a bit this time. These observations are rather obvious, but I'm going to do it anyway. Pppffftt.

Apple is reportedly pitching a $30/month subscription for TV via iTunes to content producers, to be offered sometime in early 2010.
According to the iFixit teardown, the LCD in the recently announced 27" iMac is an In Plane Switching (IPS) design by LG. This provides a very wide viewing angle compared to the Twisted Nematic (TN) LCDs typically used in personal computers. IPS LCDs are generally used in television sets, as one wants to see the TV from the entire width of the couch not just one narrow section.
The iMac LCD has an unusual native resolution, 2560x1440. It is exactly double in each dimension as the AppleTV, making it straightforward to optimize content for both devices.
There is no TV tuner in either AppleTV or the iMac, though of course external USB tuners for broadcast HDTV are available. Nonetheless, HDTV broadcast is yesterday's technology.
Apple is spending $1 billion to build a massive datacenter in Maiden, North Carolina. Certainly Apple's current MobileMe and iTunes services demand significant capacity to host them, but with such a large investment it seems likely that Apple is looking to expand into new areas and not just continue its existing services.
Altogether Apple is budgeting $1.9 billion for capital expenditures in fiscal 2010, an increase from the $1.1 billion in 2009.

AppleTV So, Apple is preparing to do to video distribution what it did in music, providing a complete solution including content, delivery, and customer device, right? Apple already provides selected video content on iTunes, and can expand from there.

Though there are a lot of clues, there is a big piece missing. If one is looking to spend billions to develop the infrastructure to become a big player in video distribution, a single massive datacenter on the US east coast is not the way to do it. It leaves you beholden to others to carry the bits around the planet, which becomes the dominant cost of the business.

The modern Internet consists of a series of interconnected backbone networks, generally referred to as "Autonomous Systems" following the BGP terminology. Packets leave the datacenter and traverse the backbones on their way to their destination. The final telecom facility before the packet reaches its destination is called the Edge Point of Presence. Delivering media streams all the way across the internet to thousands of individual subscribers incurs significant bandwidth costs. Content Delivery Networks reduce the bandwidth costs by scattering small caching servers across thousands of POPs, serving each subscriber from the closest cache.

Apple currently relies on both Akamai and Limelight to deliver its iTunes content. Apple could certainly expand its current relationship with those vendors to carry substantially more video traffic... but for the kind of investment they are making, it seems like the money would be better spent scattering a number of smaller facilities across the planet. Peer to peer between end users seems like an ideal way to distribute video by offloading the bandwidth costs, but in practice it has not worked very well. Content distribution at scale is something which requires significant investment.

Apple has $34 billion in liquid assets, and an economic recession is a great time to buy. I'll be watching for indications that Apple is bringing this solution in house either by building lots of smaller facilities, perhaps with their own fiber network between them, or by building or acquiring a CDN of their own. Massive investment at the endpoints while remaining completely dependent on others for the connection between them doesn't make sense.

Tuesday, September 8, 2009

Infinite Arrays of Tweeples

Twitter followers list Twitter is somewhat out of the range of topics I normally cover here, but I promise we'll come around to a software development angle by the end of this post.

When you follow someone on twitter, you appear in their followers list and they appear in your following list. New entries appear at the top, so the newest follower will be the first entry in your list. Recently I noticed an exception to this sort ordering, when someone who had been following a long time ago and later unfollowed decided to follow again. I received the notification email from twitter, yet he wasn't at the top of the list of followers. Instead he appeared much further down in the listing, off the first page. That is where he was when he followed me the first time, many months ago. This entry disappeared when he unfollowed, and when he re-followed he ended up back in that same place. Why would that be?

Possibility #1: Timestamp

The first possibility, and more likely the correct one, is that twitter tracks the timestamp of every new follow and chooses not to update it on a subsequent refollow. No matter how many times you have followed/unfollowed, you retain the timestamp of the very first time and will show up in the followers list at that position.

If an unfollow+refollow was sufficient to move you back up to the top of the list of followers, the bots would do it all the time to make it more likely you'd follow them back. Yet this is a boring root cause. So lets consider a second possibility which is more illustrative to software development.

Possibility #2: Array Deletion

Twitter operates at a scale where performance optimization is essential. If they are not cognizant about performance the wheels fly off and users start to see the Fail Whale. An area of particular importance is the list of followers, as the backend infrastructure has to traverse it for every tweet. It is possible that twitter implemented the followers list as an array in memory instead of a linked list, presumably to get better locality. The classic drawback of an array is deletion: you cannot delete an element from the middle without moving all subsequent elements into the hole thus created. To avoid this compaction a "deleted" or "active" bit is commonly kept for each element, allowing deleted entries to be left in place but skipped without processing.

When Scobleizer unfollowed everyone it would have resulted in holes in the followers list of 106,000 different accounts, entries with the deleted bit set.

I suspect that Twitter does not immediately compact these arrays, so long as the ratio of holes/filled entries is tolerable. When Scobleizer decided to re-follow me the twitter backend located the earlier, deleted entry and flipped the bit back to active.

Thus the newly restored entry will re-appear in the followers list, but not as the top most entry. It will re-appear at its existing position within the array. This, or a similar implementation choice of retaining deleted entries in some way, could be why re-follows do not appear at the top of the list.

The Moral of the Story

Optimization is fine, and absolutely crucial to function at Twitter scale, but one must to be careful when an optimization changes user-visible behavior. This is particularly true for social media, where we're explicitly conversing with other humans and ascribe human motivations to their actions. Twitter's handling of deleted and re-added follows can cause considerable consternation, because to the casual observer it appears the person followed but then immediately unfollowed. It can seem judgmental.

Of course, I am most likely completely wrong about Twitter's implementation using arrays. It wouldn't be the first time I've made a complete fool out of myself in a blog post. Its cathartic, in a way. Perhaps I'll do it more often.

Tuesday, July 14, 2009

DRY and the DMV

The Pragmatic Programmer is one of the best books available concerning the development of quality software. It is structured as a series of tips, with illustrative examples and the occasional horror story. One of the first tips is the DRY principle:

DRY - Don't Repeat Yourself
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

DRY is often misinterpreted to mean simply that code should not be duplicated, but it is somewhat more subtle: don't duplicate state. If you have multiple different places in the code which keep state about an aspect of the system, and all places have to have the same content at all times for the system to work properly, then you have made maintenance of the system harder than it needs to be. You'll have to debug cases where the representations fall out of sync, and all such places must be updated at the same time when code changes are made. The Pragmatic Programmers extend the DRY principle outside of the code itself to include database schemes, documentation, and build systems. Everything should have one authoritative source.

This brings us to the Department of Motor Vehicles, though the Gentle Reader might at first not see the connection. I received a form in the mail to renew my driver's license, which I promptly signed and sent back. The new license arrived in due course, and things were fine until a few weeks later when I noticed the address was incorrect. The old license is correct, the new one is wrong.

I've no idea whether the address was correct on the renewal form, I did not check it. Apparently I should have, but I didn't bother - it hadn't changed. At some stage of the renewal process, a single digit was altered in a subtle way.

Why was it even possible for the address to be changed in the renewal process? Here we can only speculate. The DMV does need a procedure to update an address as part of a license renewal, because sometimes people supply a new one on the form. I'll speculate that the DMV, either via OCR or manual typing, re-enters the address in all cases and not just if the form supplied a change. This procedure depends on the original address to be faithfully reproduced in cases where it wasn't supposed to change. In my case, either due to OCR glitch or typing error, a digit changed resulting in the new license being printed with an incorrect address.

I believe this is an example of the consequences of a violation of the DRY principle. The same state - my address - exists in two places: in the DMV database and on the form. Those two pieces of state are supposed to be the same, indeed must be the same for the process to work correctly, but errors can easily occur which allow their contents to get out of sync.

A corollary lesson in this situation: if state isn't supposed to change, don't change it. If the form does not indicate a change of address, the authoritative state is in the database and the form contents should be ignored.

Aftereffects

I've already received a jury summons at the incorrect address, which the post office helpfully delivered to me anyway. Even after correcting the address I suspect I will receive a summons twice as often from now on. That will form the basis of a future blog post to illustrate the importance of duplicate suppression in databases, I suppose.

I can change my address back by submitting a form to the DMV, but issuing a new license with the correct address will be at my own cost. This is part of the price of modern life, I suppose.