Coding Relic

Thursday, February 9, 2012

ATA Commands in Python

In software development sometimes you spend time on an implementation which you are unreasonably proud of, but ultimately decide not to use in the product. This is one such story.

I needed to retrieve information from an attached disk, such as its model and serial number. There are commands which can do this, like hdparm, sdparm, and smartctl, but initially I tried to avoid building in a dependency on any such tools by interrogating it from the hard drive directly. In pure Python.

Snicker all you want, it did work. My first implementation used an older Linux API to retrieve this information, the HDIO_GET_IDENTITY ioctl. This ioctl maps more or less directly to an ATA IDENTIFY or SCSI INQUIRY from the drive. The implementation uses the Python struct module to define the data structure sent along with the ioctl.

def GetDriveId(dev):
  """Return information from interrogating the drive.

  This routine issues a HDIO_GET_IDENTITY ioctl to a block device,
  which only root can do.

  Args:
    dev: name of the device, such as 'sda' or '/dev/sda'

  Returns:
    (serial_number, fw_version, model) as strings
  """
  # from /usr/include/linux/hdreg.h, struct hd_driveid
  # 10H = misc stuff, mostly deprecated
  # 20s = serial_no
  # 3H  = misc stuff
  # 8s  = fw_rev
  # 40s = model
  # ... plus a bunch more stuff we don't care about.
  struct_hd_driveid = '@ 10H 20s 3H 8s 40s'
  HDIO_GET_IDENTITY = 0x030d
  if dev[0] != '/':
    dev = '/dev/' + dev
  with open(dev, 'r') as fd:
    buf = fcntl.ioctl(fd, HDIO_GET_IDENTITY, ' ' * 512)
    fields = struct.unpack_from(struct_hd_driveid, buf)
    serial_no = fields[10].strip()
    fw_rev = fields[14].strip()
    model = fields[15].strip()
    return (serial_no, fw_rev, model)

No no wait, stop snickering, it does work! It has to run as root, which is one reason why I eventually abandoned this approach.

$ sudo python hdio.py
('5RY0N6BD', '3.ADA', 'ST3250310AS')

HDIO_GET_IDENTITY is deprecated in Linux 2.6, and logs a message saying that sg_io should be used instead. sg_io is an API to send SCSI commands to a device. sg_io also didn't require my entire Python process to run as root, I'd "only" have to give it CAP_SYS_RAWIO. So of course I changed the implementation... still in Python. Stop snickering.

class AtaCmd(ctypes.Structure):
  """ATA Command Pass-Through
     http://www.t10.org/ftp/t10/document.04/04-262r8.pdf"""

  _fields_ = [
      ('opcode', ctypes.c_ubyte),
      ('protocol', ctypes.c_ubyte),
      ('flags', ctypes.c_ubyte),
      ('features', ctypes.c_ubyte),
      ('sector_count', ctypes.c_ubyte),
      ('lba_low', ctypes.c_ubyte),
      ('lba_mid', ctypes.c_ubyte),
      ('lba_high', ctypes.c_ubyte),
      ('device', ctypes.c_ubyte),
      ('command', ctypes.c_ubyte),
      ('reserved', ctypes.c_ubyte),
      ('control', ctypes.c_ubyte) ]


class SgioHdr(ctypes.Structure):
  """<scsi/sg.h> sg_io_hdr_t."""

  _fields_ = [
      ('interface_id', ctypes.c_int),
      ('dxfer_direction', ctypes.c_int),
      ('cmd_len', ctypes.c_ubyte),
      ('mx_sb_len', ctypes.c_ubyte),
      ('iovec_count', ctypes.c_ushort),
      ('dxfer_len', ctypes.c_uint),
      ('dxferp', ctypes.c_void_p),
      ('cmdp', ctypes.c_void_p),
      ('sbp', ctypes.c_void_p),
      ('timeout', ctypes.c_uint),
      ('flags', ctypes.c_uint),
      ('pack_id', ctypes.c_int),
      ('usr_ptr', ctypes.c_void_p),
      ('status', ctypes.c_ubyte),
      ('masked_status', ctypes.c_ubyte),
      ('msg_status', ctypes.c_ubyte),
      ('sb_len_wr', ctypes.c_ubyte),
      ('host_status', ctypes.c_ushort),
      ('driver_status', ctypes.c_ushort),
      ('resid', ctypes.c_int),
      ('duration', ctypes.c_uint),
      ('info', ctypes.c_uint)]

def SwapString(str):
  """Swap 16 bit words within a string.

  String data from an ATA IDENTIFY appears byteswapped, even on little-endian
  achitectures. I don't know why. Other disk utilities I've looked at also
  byte-swap strings, and contain comments that this needs to be done on all
  platforms not just big-endian ones. So... yeah.
  """
  s = []
  for x in range(0, len(str) - 1, 2):
    s.append(str[x+1])
    s.append(str[x])
  return ''.join(s).strip()

def GetDriveIdSgIo(dev):
  """Return information from interrogating the drive.

  This routine issues a SG_IO ioctl to a block device, which
  requires either root privileges or the CAP_SYS_RAWIO capability.

  Args:
    dev: name of the device, such as 'sda' or '/dev/sda'

  Returns:
    (serial_number, fw_version, model) as strings
  """

  if dev[0] != '/':
    dev = '/dev/' + dev
  ata_cmd = AtaCmd(opcode=0xa1,  # ATA PASS-THROUGH (12)
                   protocol=4<<1,  # PIO Data-In
                   # flags field
                   # OFF_LINE = 0 (0 seconds offline)
                   # CK_COND = 1 (copy sense data in response)
                   # T_DIR = 1 (transfer from the ATA device)
                   # BYT_BLOK = 1 (length is in blocks, not bytes)
                   # T_LENGTH = 2 (transfer length in the SECTOR_COUNT field)
                   flags=0x2e,
                   features=0, sector_count=0,
                   lba_low=0, lba_mid=0, lba_high=0,
                   device=0,
                   command=0xec,  # IDENTIFY
                   reserved=0, control=0)
  ASCII_S = 83
  SG_DXFER_FROM_DEV = -3
  sense = ctypes.c_buffer(64)
  identify = ctypes.c_buffer(512)
  sgio = SgioHdr(interface_id=ASCII_S, dxfer_direction=SG_DXFER_FROM_DEV,
                 cmd_len=ctypes.sizeof(ata_cmd),
                 mx_sb_len=ctypes.sizeof(sense), iovec_count=0,
                 dxfer_len=ctypes.sizeof(identify),
                 dxferp=ctypes.cast(identify, ctypes.c_void_p),
                 cmdp=ctypes.addressof(ata_cmd),
                 sbp=ctypes.cast(sense, ctypes.c_void_p), timeout=3000,
                 flags=0, pack_id=0, usr_ptr=None, status=0, masked_status=0,
                 msg_status=0, sb_len_wr=0, host_status=0, driver_status=0,
                 resid=0, duration=0, info=0)
  SG_IO = 0x2285  # <scsi/sg.h>
  with open(dev, 'r') as fd:
    if fcntl.ioctl(fd, SG_IO, ctypes.addressof(sgio)) != 0:
      print "fcntl failed"
      return None
    if ord(sense[0]) != 0x72 or ord(sense[8]) != 0x9 or ord(sense[9]) != 0xc:
      return None
    # IDENTIFY format as defined on pg 91 of
    # http://t13.org/Documents/UploadedDocuments/docs2006/D1699r3f-ATA8-ACS.pdf
    serial_no = SwapString(identify[20:40])
    fw_rev = SwapString(identify[46:53])
    model = SwapString(identify[54:93])
    return (serial_no, fw_rev, model)

For the unbelievers out there, this one works too.

$ sudo python sgio.py
('5RY0N6BD', '3.ADA', 'ST3250310AS')

So, there you go. HDIO_GET_IDENTITY and SG_IO implemented in pure Python. They do work, but in the process of working on this and reading existing code it became clear that low level ATA handling is fraught with peril. The disk industry has been iterating this interface for decades, and there is a ton of gear out there that made questionable choices in how to interpret the spec. Most of the code in existing utilities is not to implement the base operations but instead to handle the quirks from various manufacturers. I decided that I didn't want to go down that path, and will instead rely on forking sdparm and smartctl as needed.

I'll just leave this post here for search engines to find. I'm sure there is a ton of demand for this information.

Stop snickering.

Thursday, February 2, 2012

ISC DHCP VIVO config

In addition to its role in assigning IPv4 addresses, DHCP has an options mechanism to send other bits of data between client and server. For example there are options to provide the DNS and NTP server addresses along with the client's assigned IP address. In its request, the client lists the options it would like to receive. The server fills in whichever ones it can.

DHCP has always allowed for vendor extensions of the available options, inheriting this support from the earlier BOOTP protocol. The original vendor extension mechanism was very simple: use option 43, and put whatever you like there. A mechanism was provided to encode multiple sub-options within the option 43 data, but made no attempt to coordinate between different vendors use of the space. Vendors immediately began creating conflicts, using the same numeric codes for different means simultaneously. This led to a variety of heroic hacks in which the client would populate its request with magic values which the server would use to figure out the set of vendor options to supply.

DHCP6 defined a more complex encoding, where each vendor includes their unique IANA Enterprise Number as part of its option. Options from different vendors can be accommodated simultaneously. This Vendor-Identifying Vendor Options (VIVO) encoding was also added back to DHCP4 as options 124 and 125. DHCP4 thus has two separate vendor option mechanisms in common use.

ISC DHCPd

The ISC DHCP server can support VIVO options using several different mechanisms. The first, and so far as I can tell most common, is to specify the byte-level payload. The administrator pores over RFCs and vendor documentation to come up with the magic string of bytes to send and types it into dhcpd.conf, where it immediately becomes magic voodoo that everyone is afraid to touch for fear of breaking something.

Avoiding magic byte strings by specifying the format of the options is more difficult to get working, but easier to maintain and understand. We'll consider an example here.

Vendor: Frobozzco
IANA Enterprise Number (IEN): 12345
Code #1: a text string containing the location within the maze.
Code #2: an integer describing the percentage likelihood of being eaten by a grue.
In practice this is always 100%, which many clients simply hard-code.

To implement this in the DHCP config, we define an option space for frobozzco. This just creates a namespace; we bind that namespace to the numeric IEN later. We have to tell DHCP how wide to make the code and length fields. DHCP4 usually uses 1 byte fields, while DHCP6 generally uses 2 bytes. Most of the time vendors don't specify the width they use, and if so you should assume the normal sizes for the protocol. The example below comes from a DHCP6 config, so the code and length are both declared as two bytes. After we've declared all of the option codes, we bind the frobozzco option space to its numeric IANA Enterprise Number. Use of IEN for DHCP is called the Vendor Specific Information Option, so the syntax in the DHCP configuration labels this vsio.{option space name}

option space frobozzco code width 2 length width 2;
option frobozzco.maze-location code 1 = text;
option frobozzco.grue-probability code 2 = uint8;
option vsio.frobozzco code 12345 = encapsulate frobozzco;

Owing to the long and sordid history of numbering conflicts, most vendor extensions define a secret handshake. The client inserts a specific value into a field in its request to trigger the server to respond with the options for that vendor. Frobozzco has decreed that clients should send the string "look north" as a vendor-class option in the request. A DHCP6 vendor-class consists of the vendor's IEN followed by the content. In our case the content consists of another two byte length field, followed by the string. ISC DHCP 4.x doesn't define a type for handling vendor-class in the config but we can construct one using a record, which is a collection of fields defined inside brackets.

option dhcp6.vendor-class code 16 = {integer 32, integer 16, string};

# length=14 bytes, Frobozzco IEN, content=look north
send dhcp6.vendor-class 12345 14 "look north";

Finally, we have to provide a script for dhclient to run to handle the received options. We'll get to the client script a bit later, for now just assume it is in /usr/local/sbin/dhclient-script. Putting it all together, the dhclient6.conf should look like this.

script "/usr/local/sbin/dhclient-script";

option space frobozzco code width 2 length width 2;
option frobozzco.maze-location code 1 = text;
option frobozzco.grue-probability code 2 = uint8;
option vsio.frobozzco code 12345 = encapsulate frobozzco;

option dhcp6.vendor-class code 16 = {integer 32, integer 16, string};

interface "eth0" {
    also request dhcp6.vendor-opts;
    send dhcp6.vendor-class 12345 10 "look north";
}

dhclient-script

On the client we also must provide the script for dhclient to run. The OS vendor will have provided one, often in /sbin or /usr/sbin. We'll copy it, and add handling.

dhclient passes in environment variables for each DHCP option. The name of the variable is "new_<option space name>_<option name>" For the example config above, we'd define a shell script function to write our two options to files in /tmp.

make_frobozzco_files() {                                                       
  mkdir /tmp/frobozzco                                            
  if [ "x${new_frobozzco_maze_location}" != x ] ; then         
    echo ${new_frobozzco_maze_location} > /tmp/frobozzco/maze_location                 
  fi                                                                      
  if [ "x${new_frobozzco_grue_probability}" != x ] ; then
    echo ${new_frobozzco_grue_probability} > /tmp/frobozzco/grue_probability
  fi                                                             
}

The dhclient-script provided with the OS will have handling for DNS nameservers. Adding a call to make_frobozzco_files at the same points in the script which handle /etc/resolv.conf is a reasonable approach to take.

I'm mostly blogging this for my own future use, to be able to find how to do something I remember doing before. There you go, future me.

Tuesday, January 31, 2012

certdata.txt

When building the software for a new device there are a ton of things which need to go into the filesystem: binaries, libraries, device nodes, a bunch of configuration files, etc. One of the chunks of essential data is a list of trusted certificate authorities for libssl, commonly stored in /etc/ssl/ca_certificates.crt. Its common practice to grab a list of certificates from mozilla.org, and there are various Perl and Python scripts floating around in the search engines to assemble this list into a PEM file suitable for libssl.

2011 was not a good year for certificate authorities. DigiNotar was seized by the Dutch government after it became clear they had been thoroughly breached and generated fraudulent certificates for many large domains. Several Comodo resellers were similarly compromised and generated bogus certs for some of the same sites. Browser makers responded by encoding a list of toxic certificates into the browser, to reject any certificate signed by them.

Encoding a list of toxic certificates is the key phrase in that paragraph. As of 2011, Mozilla's certdata.txt contains both trusted CAs and absolutely untrustworthy, revoked CAs. There is metadata in the entry describing how it should be treated, but several of the scripts floating around grab everything listed in certdata.txt and put it in the PEM file. This is disastrous.

The code to search for is CKT_NSS_NOT_TRUSTED. The utility you are using should check for that type, and skip it. If there is no handling for CKT_NSS_NOT_TRUSTED, then the utility you are using is absolutely broken. Don't use it. I know of at least two which handle this correctly:

Adam Langley's extract-nss-root-certs, written in Go. Read his announcement for more information.
OpenSUSE's extractcerts.pl, written in Perl

Wednesday, January 18, 2012

Offense

SOPA isn't dead. It hasn't been defeated. It hasn't been stopped. Its just regrouping.

Their main mistake was in allowing it to become publicly known too long before a decisive vote. Its backers will try again, next time ramming it through in the dead of night. They'll give it a scary title, as anything can be justified if the title of the bill is scary enough.

Bills like SOPA are an attempt to legislate a return to media economics the way it used to be, where the sheer cost of distributing content formed a high barrier to entry. Its the economics of scarcity. Better yet, the law would require someone else to pay the cost of creating this scarcity. If the cost of any infringement, intentional or not, third party or first party, can be made so overwhelming as to be ruinous (and incidentally decoupled from any notion of the actual damage from the infringement), then cheap distribution via the Internet can be made expensive again. We can get back to the cozy media business of prior decades.

Its time to stop playing defense, desperately trying to stop each of these bills.

Its time to start playing offense.

The workings of government are obscure and impenetrable. There are reams of data produced in the form of minutes, committee reports, the Federal Register, and other minutiae, but the whole remains an opaque mass. Lobbyists and political operatives thrive in this environment, as they understand more about the mechanisms by which it operates. Yet one of the recent core competencies of the technology industry is Big Data. There are conclusions which can be drawn from trends within the dataset without having to semantically understand all of it.

I have to believe there are things the tech industry can do beyond simply increasing lobbying budgets.

Friday, January 13, 2012

Merchant Silicon

Last week Greg Ferro wrote about the use of merchant silicon in networking products. I'd like to share some thoughts on the topic.

Chip cost

Fistful of dollars We know the story by now: by selling chips into products from multiple networking companies, commodity chips sell in large volume and benefit from larger discounts. This is a compelling factor in the low end of the switching market, where margins are thin and a primary selling point is price.

Yet low price of the switch silicon is not a decisive factor in the midrange and high end of the switching market, where products are more featureful and sell at higher prices. The price of those products is not based on the cost of materials, its based on what the market will bear. The market has traditionally borne a lot: Cisco's profit margins in these segments have been legendary for a decade.

In my experience, chip price was not a decisive factor in the wholesale move to merchant silicon.

Non Recurring Engineering (NRE cost)

Silicon chip Say it costs $10 million to design a chipset for a high end switch, and the resulting set of silicon costs $500 in the expected volumes. If that high end switch sells 10,000 units in its first year then the NRE cost for developing it amounts to $1,000 per unit, double the cost of buying the chip itself. The longer the model remains in production the more its cost can be amortized... but the company has to pay the complete cost to develop the silicon before the first unit is sold.

In the midrange and high end switch markets, the strongest pitch made by merchant semiconductor suppliers wasn't the per-chip cost. A stronger pitch for those segments was elimination of NRE. The networking company didn't have to bear the cost of chip development up front. The company did pay the cost of development, but it would be factored into the unit price and pay-as-you-go rather than upfront.

Yet even NRE savings wasn't usually enough to convince a networking company to give up its own ASIC designs. Most realized that to do so was to give up a substantial portion of their ability to differentiate their products. Several vendors adopted a hybrid approach. They used merchant silicon to provide the fabric and handle simple cases of packet forwarding, and configured flow rules to steer interesting packets out to a separate device for additional handling. Costs were reduced by only having to design that specialized chip for a subset of the total traffic through the box, but they retained an ability to differentiate features.

In my experience, eliminating the burden of NRE was not a decisive factor in the move to merchant silicon.

Schedule

Gantt chart The merchant silicon vendors of the world can dedicate more ASIC engineers to their projects. This isn't as big a win as it sounds: tripling the size of the design team does not result in a chip with 3x the features or in 1/3rd the time. As with software projects (see The Mythical Man Month), the increasing coordination overhead of a larger team results in steeply diminishing returns.

Instead, merchant silicon vendors have the luxury of working on multiple projects in parallel. They can have two teams leapfrogging each other, each working on a multiyear timeline and introducing their products in interleaving years. Alternately, they can target different chips at different market segments. They rely on their SDK to hide gratuitous differences which they happened to introduce, and only make their customers deal with the truly differentiating features of the different chips.

It is difficult to make a case to spend two years to develop custom silicon for a product when merchant silicon with sufficient features is expected to be available a year earlier. Merchant silicon suppliers share details of their roadmap very early, even before the feature set is finalized. This lets them incorporate feedback into the features for the final product, but they also do it to derail in-house silicon efforts.

Yet in my experience at least, though schedule is a decisive factor, this isn't the full story.

Misaligned Incentives

When leading a chip development effort, the biggest fear is not that the chip will have bugs. Many ASIC bugs can be worked around in software.

The biggest fear is not that the chip will be larger and more costly than planned. That is a negotiation issue with the silicon fab, or a business issue in positioning the product.

The biggest fear is that the chip will be late. Missing the market window is the worst kind of failure for an ASIC. The design team produces a chip which meets all requirements, but comes at a time when the market no longer cares. The tardy product will face significant pricing pressure on the day of introduction, more so the longer competitive products have been available.

The technical leadership of an internal ASIC project is therefore incented to plan a schedule which they are sure they can meet. They'll use realistic timelines for the different phases of the product, and include sufficient padding to handle unexpected problems. They will produce best case timelines as well, but those tend to be discounted by the project leadership as unrealistic.

The technical leadership inside merchant semiconductor companies face the same issues, and produce the same sort of schedule which they are confident they can meet. The difference is, that conservative schedule is not handed out to the decision makers at the customer networking vendors. A more optimistic schedule is maintained and presented to customers - not rosy best case, but certainly optimistic. Everybody knows that schedule will slip, even the customers themselves... but nonetheless it works. Customers work from the optimistic schedule because that is all they have. It increases the difference in schedule between in-house and merchant options by several quarters.

The Point of No Return

ASIC design requires some rather specialized skill sets. There is a great deal of similarity between chip design and software design, but not so much that one can switch freely back and forth. If there is not an active chip development effort underway, the ASIC team tends to run out of interesting things to work on.

When a company begins seriously contemplating building their high end products using merchant silicon, even if the management tries to keep it low key, it becomes pretty obvious internally. You have to pull in senior technical folks from the software, hardware, and ASIC teams to help with the evaluation. News spreads. Gossip spreads faster. If the ASIC team becomes convinced that there will be no further chip projects, they start to move on.

It can easily become a self-fulfilling prophecy: serious consideration of a move to merchant silicon leads to loss of the capability to develop custom ASICs.

Why it Matters

We talk a lot about Software Defined Networking. The term, consciously or not, tends to make people think the networking is all in software and the hardware is insignificant. That isn't actually true, as the SDN can only utilize actions which the hardware can actually do, but it illustrates how much less we value we put in hardware now.

In the context of SDN, reducing switch hardware diversity is actually a good thing. It results in a more uniform set of capabilities in networks, and a smaller set of cases for the SDN controller to have to handle. Networking used to be dominated by the hardware designs, but it has moved on now. I think that is a good thing.

Friday, December 30, 2011

#emotivehashtags

Earlier this week Sam Biddle of Gizmodo published How the Hashtag Is Ruining the English Language, decrying the use of hashtags to add additional color or meaning to text. Quoth the article, "The hashtag is a vulgar crutch, a lazy reach for substance in the personal void – written clipart." #getoffhislawn

Written communication has never been as effective as in-person conversation, nor even as simple audio via telephone. Presented with plain text, we lack a huge array of additional channels for meaning: posture, facial expression, tone, cadence, gestures, etc. Smileys can be seen as an early attempt to add emotional context to online communication, albeit a limited one. #deathtosmileys

Yet language evolves to suit our needs and to fit advances communications technology. A specific example: in the US we commonly say "Hello" as a greeting. Its considered polite, and it has always been the common practice... except that it hasn't. The greeting Hello entered the English language in the mid 19th century with the invention of the telephone. The custom until that time of speaking only after a proper introduction simply didn't work on the telephone, it wasn't practical over the distances involved to coordinate so many people. Use of Hello spread from the telephone into all areas of interaction. I suspect there were people at the time who bemoaned and berated the verbal crutch of the "hello" as they watched it push aside the more finely crafted greetings of the time. #getofftheirlawn

So now we have hashtags. Spawned by the space-constrained medium of the tweet, they are now spreading to other written forms. That they find traction in longer form media is an indication that they fill a need. They supply context, overlay emotional meaning, and convey intent, all lacking in current practice. Its easy to label hashtags as lazy or somehow vulgar. "[W]hy the need for metadata when regular words have been working so well?" questions the Gizmodo piece. Yet the sad reality is that regular words haven't been working so well. Even in the spoken word there is an enormous difference between oratory and casual conversation. A moving speech, filled with meaning in every phrase, takes a long time to prepare and rehearse. Its a rare event, not the norm day to day. The same holds true in the written word. "I apologize that this letter is so long - I lacked the time to make it short." quipped Blaise Pascal in the 17th century.

Disambiguation

Gizmodo even elicited a response from Noam Chomsky, probably via email, "Don't use Twitter, almost never see it."

What I find most interesting about Chomsky's response is that it so perfectly illustrates the problem which emotive hashtags try to solve: his phrasing is slightly ambiguous. It could be interpreted as Chomsky saying he doesn't use Twitter and so never sees hashtags, or that anyone bothered by hashtags shouldn't use Twitter so they won't see them. He probably means the former, but in an in-person conversation there would be no ambiguity. Facial expression would convey his unfamiliarity with Twitter.

For Chomsky, adding a hashtag would require extra thought and effort which could instead have gone into rewording the sentence. That, I think, is the key. For those to whom hashtags are extra work, it all seems silly and even stupid. For those whose main form of communication is short texts, it doesn't. #getoffmylawntoo

Thursday, December 22, 2011

Refactoring Is Everywhere

The utilities used to run from poles, now they are underground. The functionality is unchanged, but the implementation is cleaner.

Monday, December 19, 2011

Multiple Inheritance

Hot Dog cut to resemble octopus tentacles

Friday, December 16, 2011

The Ada Initiative 2012

Earlier this year I donated seed funding to the Ada Initiative, a non-profit organization dedicated to increasing participation of women in open technology and culture. One of their early efforts was development of an example anti-harassment policy for conference organizers, attempting to counter a number of high profile incidents of sexual harassment at events. Lacking any sort of plan for what to do after such an incident, conference organizers often did not respond effectively. This creates an incredibly hostile environment, and makes it even harder for women in technology to advance their careers through networking. Developing a coherent, written policy is a first step toward solving the problem.

The Ada Initiative is now raising funds for 2012 activities, including:

Ada’s Advice: a guide to resources for helping women in open tech/culture
Ada’s Careers: a career development community for women in open tech/culture
First Patch Week: help women write and submit a patch in a week
AdaCamp and AdaCon: (un)conferences for women in open tech/culture
Women in Open Source Survey: annual survey of women in open source

For me personally

There are many barriers discouraging women from participating in the technology field. Donating to the Ada Initiative is one thing I'm doing to try to change that. I'm posting this to ask other people to join me in supporting this effort.

My daughter is 6. The status quo is unacceptable. Time is short.

Monday, December 12, 2011

Go Go Gadget Google Currents!

Last week Google introduced Currents, a publishing and distribution platform for smartphones and tablets. I decided to publish this blog as an edition, and wanted to walk through how it works.

Publishing an Edition

Google Currents producer screenshot Setting up the publisher side of Google Currents was straightforward. I entered data in a few tabs of the interface:

Edition settings: Entered the name for the blog, and the Google Analytics ID used on the web page.

Sections: added a "Blog" section, sourced from the RSS feed for this blog. I use Feedburner to post-process the raw RSS feed coming from Blogger. However I saw no difference in the layout of the articles in Google Currents between Feedburner and the Blogger feed. As Currents provides statistics using Google Analytics, I didn't want to have double counting by having the same users show up in the Feedburner analytics. I went with the RSS feed from Blogger.

Sections->Blog: After adding the Blog section I customized its CSS slightly, to use the paper tape image from the blog masthead as a header. I uploaded a 400x50 version of the image to the Media Library, and modified the CSS like so:

.customHeader {
  background-color: #f5f5f5;
  display: -webkit-box;
  background-image:  url('attachment/CAAqBggKMNPYLDDD3Qc-GoogleCurrentsLogo.jpg');
  background-repeat: repeat-x;
  height: 50px;
  -webkit-box-flex: 0;
  -webkit-box-orient: horizontal;
  -webkit-box-pack: center;
}

Manage Articles: I didn't do anything special here. Once the system has fetched content from RSS it is possible to tweak its presentation here, but I doubt I will do that. There is a limit to the amount of time I'll spend futzing.

Media Library: I uploaded the header graphic to use in the Sections tab.

Grant access: anyone can read this blog.

Distribute: I had to click to verify content ownership. As I had already gone through the verification process for Google Webmaster Tools, the Producer verification went through without additional effort. I then clicked "Distribute" and voila!

The Point?

iPad screenshot of this site in Google Currents Much of the publisher interface concerns formatting and presentation of articles. RSS feeds generally require significant work on the formatting to look reasonable, a service performed by Feedburner and by tools like Flipboard and Google Currents. Nonetheless, I don't think the formatting is the main point, presentation is a means to an end. RSS is a reasonable transport protocol, but people have pressed it into service as the supplier of presentation and layout as well by wrapping a UI around it. Its not very good at it. Publishing tools have to expend effort on presentation and layout to make it useable.

Nonetheless, for me at least, the main point of publishing to Google Currents is discoverability. I'm hopeful it will evolve into a service which doesn't just show me material I already know I'm interested in, but also becomes good at suggesting new material which fits my interests.

Community Trumps Content

A concern has been expressed that content distribution tools like this, which use web protocols but are not a web page, will kill off the blog comments which motivate many smaller sites to continue publishing. The thing is, in my experience at least, blog comments all but died long ago. Presentation of the content had nothing to do with it: Community trumps Content. That is, people motivated to leave comments tend to gravitate to an online community where they can interact. They don't confine themselves to material from a single site. Only the most massive blogs have the gravitational attraction to hold a community together. The rest quickly lose their atmosphere to Reddit/Facebook/Google+/etc. I am grateful when people leave comments on the blog, but I get just as much edification from a comment on a social site, and just as much consternation if the sentiment is negative, as if it is here. It is somewhat more difficult for me to find comments left on social sites, but let me be perfectly clear: that is my problem, and my job to stay on top of.

The Mobile Web

One other finding from setting up Currents: the Blogger mobile templates are quite good. The formatting of this site in a mobile browser is very nice, and similar to the formatting which Currents comes up with. To me Currents is mostly about discoverability, not just presentation.

Wednesday, December 7, 2011

Requiem for Jumbo Frames

This weekend Greg Ferro published an article about jumbo frames. He points to recent measurements showing no real benefit with large frames. Some years ago I worked on NIC designs, and at the time we talked about Jumbo frames a lot. It was always a tradeoff: improve performance by sacrificing compatibility, or live with the performance until hardware designs could make the 1500 byte MTU be as efficient as jumbo frames. The latter school of thought won out, and they delivered on it. Jumbo frames no longer offer a significant performance advantage.

Roughly speaking, software overhead for a networking protocol stack can be divided into two chunks:

Per-byte which increases with each byte of data sent. Data copies, encryption, checksums, etc make up this kind of overhead.
Per-packet which increases with each packet regardless of how big the packet is. Interrupts, socket buffer manipulation, protocol control block lookups, and context switches are examples of this kind of overhead.

Wayback machine to 1992

I'm going to talk about the evolution of operating systems and NICs starting from the 1990s, but will focus on Unix systems. DOS and MacOS 6.x were far more common back then, but modern operating systems evolved more similarly to Unix than to those environments.

Address spaces in user space, kernel, and NIC hardware Lets consider a typical processing path for sending a packet in a Unix system in the early 1990s:

Application calls write(). System copies a chunk of data into the kernel, to mbufs/mblks/etc.
Kernel buffers handed to TCP/IP stack, which looks up the protocol control block (PCB) for the socket.
Stack calculates a TCP checksum and populates the TCP, IP, and Ethernet headers.
Ethernet driver copies kernel buffers out to the hardware. Programmed I/O using the CPU to copy was quite common in 1992.
Hardware interrupts when the transmission is complete, allowing the driver to send another packet.

Altogether the data was copied two and a half times: from user space to kernel, from kernel to NIC, plus a pass over the data to calculate the TCP checksum. There were additionally per packet overheads in looking up the PCB, populating headers, and handling interrupts.

The receive path was similar, with a NIC interrupt kicking off processing of each packet and two and a half copies up to the receiving application. There was more per-packet overhead for receive: where transmit could look up the PCB once and process a sizable chunk of data from the application in one swoop, RX always gets one packet at a time.

Jumbo frames were a performance advantage in this timeframe, but not a huge one. Larger frames reduced the per-packet overhead, but the per-byte overheads were significant enough to dominate the performance numbers.

Wayback Machine to 1999

An early optimization was elimination of the separate pass over the data for the TCP checksum. It could be folded into one of the data copies, and NICs also quickly added hardware support. [Aside: the separate copy and checksum passes in 4.4BSD allowed years of academic papers to be written, putting whatever cruft they liked into the protocol, yet still portraying it as a performance improvement by incidentally folding the checksum into a copy.] NICs also evolved to be DMA devices; the memory subsystem still had to bear the overhead of the copy to hardware, but the CPU load was alleviated. Finally, operating systems got smarter about leaving gaps for headers when copying data into the kernel, eliminating a bunch of memory allocation overhead to hold the TCP/IP/Ethernet headers.

Packet size vs throughput in 2000, 2.5x for 9180 byte vs 1500 I have data on packet size versus throughput in this timeframe, collected in the last months of 2000. It was gathered for a presentation at LCN 2000. It used an OC-12 ATM interface, where LAN emulation allowed MTUs up to 18 KBytes. I had to find an old system to run these, the modern systems of the time could almost max out the OC-12 link with 1500 byte packets. I recall it being a Sparcstation-20. The ATM NIC supported TCP checksums in hardware and used DMA.

Roughly the year 1999 was the peak of when jumbo frames would have been most beneficial. Considerable work had been done by that point to reduce per-byte overheads, eliminating the separate checksumming pass and offloading data movement from the CPU. Some work had been done to reduce the per-packet overhead, but not as much. After 1999 additional hardware focussed on reducing the per-packet overhead, and jumbo frames gradually became less of a win.

LSO/LRO

Protocol stack handing a chunk of data to NIC Large Segment Offload (LSO), referred to as TCP Segmentation Offload (TSO) in Linux circles, is a technique to copy a large chunk of data from the application process and hand it as-is to the NIC. The protocol stack generates a single set of Ethernet+TCP+IP header to use as a template, and the NIC handles the details of incrementing the sequence number and calculating fresh checksums for a new header prepended to each packet. Chunks of 32K and 64K are common, so the NIC transmits 21 or 42 TCP segments without further intervention from the protocol stack.

The interesting thing about LSO and Jumbo frames is that Jumbo frames no longer make a difference. The CPU only gets involved for every large chunk of data, the overhead is the same whether that chunk turns into 1500 byte or 9000 byte packets on the wire. The main impact of the frame size is the number of ACKs coming back, as most TCP implementations generate an ACK for every other frame. Transmitting jumbo frames would reduce the number of ACKs, but that kind of overhead is below the noise floor. We just don't care.

There is a similar technique for received packets called, imaginatively enough, Large Receive Offload (LRO). For LSO the NIC and protocol software are in control of when data is sent. For LRO, packets just arrive whenever they arrive. The NIC has to gather packets from each flow to hand up in a chunk. Its quite a bit more complex, and doesn't tend to work as well as LSO. As modern web application servers tend to send far more data than they receive, LSO has been of much greater importance than LRO.

Large Segment Offload mostly removed the justification for jumbo frames. Nonetheless support for larger frame sizes is almost universal in modern networking gear, and customers who were already using jumbo frames have generally carried on using them. Moderately larger frame support is also helpful for carriers who want to encapsulate customer traffic within their own headers. I expect hardware designs to continue to accommodate it.

TCP Calcification

There has been a big downside of pervasive use of LSO: it has become the immune response preventing changes in protocols. NIC designs vary widely in their implementation of the technique, and some of them are very rigid. Here "rigid" is a euphemism for "mostly crap." There are NICs which hard-code how to handle protocols as they existed in the early part of this century: Ethernet header, optional VLAN header, IPv4/IPv6, TCP. Add any new option, or any new header, and some portion of existing NICs will not cope with it. Making changes to existing protocols or adding new headers is vastly harder now, as changes are likely to throw the protocol back into the slow zone and render moot any of the benefits it brings.

It used to be that any new TCP extension had to carefully negotiate between sender and receiver in the SYN/SYN+ACK to make sure both sides would support an option. Nowadays due to LSO and to the pervasive use of middleboxes, we basically cannot add options to TCP at all.

I guess the moral is, "be careful what you wish for."

Monday, November 28, 2011

QFabric Followup

In August this site published a series of posts about the Juniper QFabric. Since then Juniper has released hardware documentation for the QFabric components, so its time for a followup.

QF edge Nodes, Interconnects, and Directors QFabric consists of Nodes at the edges wired to large Interconnect switches in the core. The whole collection is monitored and managed by out of band Directors. Juniper emphasizes that the QFabric should be thought of as a single distributed switch, not as a network of individual switches. The entire QFabric is managed as one entity.

Control header prepended to frame The fundamental distinction between QFabric and conventional switches is in the forwarding decision. In a conventional switch topology each layer of switching looks at the L2/L3 headers to figure out what to do. The edge switch sends the packet to the distribution switch, which examines the headers again before sending the packet on towards the core (which examines the headers again). QFabric does not work this way. QFabric functions much more like the collection of switch chips inside a modular chassis: the forwarding decision is made by the ingress switch and is conveyed through the rest of the fabric by prepending control headers. The Interconnect and egress Node forward the packet according to its control header, not via another set of L2/L3 lookups.

Node Groups

The Hardware Documentation describes two kinds of Node Groups, Server and Network, which gather multiple edge Nodes together for common purposes.

Server Node Groups are straightforward: normally the edge Nodes are independent, connecting servers and storage to the fabric. Pairs of edge switches can be configured as Server Node Groups for redundancy, allowing LAG groups to span the two switches.
Network Node Groups configure up to eight edge Nodes to interconnect with remote networks. Routing protocols like BGP or OSPF run on the Director systems, so the entire Group shares a common Routing Information Base and other data.

Why have Groups? Its somewhat easier to understand the purpose of the Network Node Group: routing processes have to be spun up on the Directors, and perhaps those processes have to point to some distinct entity to operate with. Why have Server Node Groups, though? Redundant server connections are certainly beneficial, but why require an additional fabric configuration to allow it?

Ingress fanout to four LAG member ports I don't know the answer, but I suspect it has to do with Link Aggregation (LAG). Server Node Groups allow a LAG to be configured using ports spanning the two Nodes. In a chassis switch, LAG is handled by the ingress chip. It looks up the destination address to find the destination port. Every chip knows the membership of all LAGs in the chassis. The ingress chip computes a hash of the packet to pick which LAG member port to send the packet to. This is how LAG member ports can be on different line cards, the ingress port sends it to the correct card.

Ingress fanout to four LAG member ports The downside of implementing LAG at ingress is that every chip has to know the membership of all LAGs in the system. Whenever a LAG member port goes down, all chips have to be updated to stop using it. With QFabric, where ingress chips are distributed across a network and the largest fabric could have thousands of server LAG connections, updating all of the Nodes whenever a link goes down could take a really long time. LAG failure is supposed to be quick, with minimal packet loss when a link fails. Therefore I wonder if Juniper has implemented LAG a bit differently, perhaps by handling member port selection in the Interconnect, in order to minimize the time to handle a member port failure.

I feel compelled to emphasize again: I'm making this up. I don't know how QFabric is implemented nor why Juniper made the choices they made. Its just fun to speculate.

Virtualized Junos

Regarding the Director software, the Hardware Documentation says, "[Director devices] run the Junos operating system (Junos OS) on top of a CentOS foundation." Now that is an interesting choice. Way, way back in the mists of time, Junos started from NetBSD as its base OS. NetBSD is still a viable project and runs on modern x86 machines, yet Juniper chose to hoist Junos atop a Linux base instead.

I suspect that in the intervening time, the Junos kernel and platform support diverged so far from NetBSD development that it became impractical to integrate recent work from the public project. Juniper would have faced a substantial effort to handle modern x86 hardware, and chose instead to virtualize the Junos kernel in a VM whose hardware was easier to support. I'll bet the CentOS on the Director is the host for a Xen hypervisor.

Update: in the comments, Brandon Bennett and Julien Goodwin both note that Junos used FreeBSD as its base OS, not NetBSD.

Aside: with network OSes developed in the last few years, companies have tended to put effort into keeping the code portable enough to run on a regular x86 server. The development, training, QA, and testing benefits of being able to run on a regular server are substantial. That means implementing a proper hardware abstraction layer to handle running on a platform which doesn't have the fancy switching silicon. In the 1990s when Junos started, running on x86 was not common practice. We tended to do development on Sparcstations, DECstations, or some other fancy RISC+Unix machine and didn't think much about Intel. The RISC systems were so expensive that one would never outfit a rack of them for QA, it was cheaper to build a bunch of switches instead.

Aside, redux: Junosphere also runs Junos as a virtual machine. In a company the size of Juniper these are likely to have been separate efforts, which might not even have known about each other at first. Nonetheless the timing of the two products is close enough that there may have been some cross-group pollination and shared underpinnings.

Misc Notes

The Director communicates with the Interconnects and Nodes via a separate control network, handled by Juniper's previous generation EX4200. This is an example of using a simpler network to bootstrap and control a more complex one.
QFX3500 has four QSFPs for 40 gig Ethernet. These can each be broken out into four 10G Ethernet ports, except the first one which supports only three 10G ports. That is fascinating. I wonder what the fourth one does?

Thats all for now. We may return to QFabric as it becomes more widely deployed or as additional details surface.

footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, November 23, 2011

Unnatural BGP

Last week Martin Casado published some thoughts about using OpenFlow and Software Defined Networking for simple forwarding. That is, does SDN help in distributing shortest path routes for IP prefixes? BGP/OSPF/IS-IS/etc are pretty good for this, with the added benefit of being fully distributed and thoroughly debugged.

The full article is worth a read. The summary (which Martin himself supplied) is "I find it very difficult to argue that SDN has value when it comes to providing simple connectivity." Existing routing protocols are quite good at distributing shortest path prefix routes, the real value of SDN is in handling more complex behaviors.

To expand on this a bit, there have been various efforts over the years to tailor forwarding behavior using more esoteric cost functions. The monetary cost of using a link is a common one to optimize for, as it provides justification for spending on a development effort and also because the business arrangements driving the pricing tend not to distill down to simple weights on a link. Providers may want to keep their customer traffic off of competing networks who are in a position to steal the customer. Transit fees may kick in if a peer delivers significantly more traffic than it receives, providing an incentive to preferentially send traffic through a peer in order to keep the business arrangement equitable. Many of these examples are covered in slides from a course by Jennifer Rexford, who spent several years working on such topics at AT&T Research.

BGP peering between routers at low weight, from each router to controller at high weight Until quite recently these systems had to be constructed using a standard routing protocol, because that is what the routers would support. BGP is a reasonable choice for this because its interoperability between modern implementations is excellent. The optimization system would peer with the routers, periodically recompute the desired behavior, and export those choices as the best route to destinations. To avoid having the Optimizer be a single point of failure able to bring down the entire network, the routers would retain peering connections with each other at a low weight as a fallback. The fallback routes would never be used so long as the Optimizer routes are present.

This works. It solves real problems. However it is hard to ignore the fact that BGP adds no value in the implementation of the optimization system. Its just an obstacle in the way of getting entries into the forwarding tables of the switch fabric. It also constrains the forwarding behaviors to those which BGP can express, generally some combination of destination address and QoS.

BGP peering between routers, SDN to controller Product support for software defined networking is now appearing in the market. These are generally parallel control paths alongside the existing routing protocols. SDN deposits routes into the same forwarding tables as BGP and OSPF, with some priority or precedence mechanism to control arbitration.

By using an SDN protocol these optimization systems are no longer constrained to what BGP can express, they can operate on any information which the hardware supports. Yet even here there is an awkward interaction with the other protocols. Its useful to keep the peering connections with other routers as a fallback in case of controller failure, but they are not well integrated. We can only set precedences between SDN and BGP and hope for the best.

I do wonder if the existing implementation of routing protocols needs a more significant rethink. There is great value in retaining compatibility with the external interfaces: being able to peer with existing BGP/OSPF/etc nodes is a huge benefit. In contrast, there is little value to retaining the internal implementation choices inside the router. The existing protocols could be made to cooperate more flexibly with other inputs. More speculatively, extensions to the protocol itself could label routes which are expected to be overridden by another source, and only present as a fallback path.

Monday, November 14, 2011

The Computer is the Network

Modern commodity switch fabric chips are amazingly capable, but their functionality is not infinite. In particular their parsing engines are generally fixed function, extracting information from the set of headers they were designed to process. Similarly the ability to modify packets is constrained to specifically designed in protocols, not an infinitely programmable rewrite engine.

Software defined networks are a wonderful thing, but development of an SDN agent to drive an existing ASIC does not suddenly make it capable of packet handling it wasn't already designed to do. At best, it might expose functions of which the hardware was always capable but had not been utilized by the older software. Yet even that is questionable: once a platform goes into production, the expertise necessary to thoroughly test and develop bug workarounds for ASIC functionality rapidly disperses to work on new designs. If part of the functionality isn't ready at introduction it is often removed from the documentation and retargeted as a feature of the next chip.

Decisions at the Edge

MPLS networks have an interesting philosophy: the switching elements at the core are conceptually simple, driven by a label stack prepended to the packet. Decisions are made at the edge of the network wherever possible. The core switches may have complex functionality dealing with fast reroutes or congestion management, but they avoid having to re-parse the payloads and make new forwarding decisions.

Ethernet switches have mostly not followed this philosophy, in fact we've essentially followed the opposite path. We've tended to design in features and capacity at the same time. Larger switch fabrics with more capacity also tend to have more features. Initially this happened because a chip with more ports required a larger silicon die to have room for all of the pins. Thus, there was more room for digital logic. Vendors have accentuated this in their marketing plans, omitting software support for features in "low end" edge switches even if they use the same chipset as the more featureful aggregation products.

This leaves software defined networking in a bit of a quandary. The MPLS model is simpler to reason about for large collections of switches, you don't have a combinatorial explosion of decision-making at each hop in the forwarding. Yet non-MPLS Ethernet switches have mostly not evolved in that way, and the edge switches don't have the capability to make all of the decisions for behaviors we might want.

Software Switches to the Rescue

A number of market segments have gradually moved to a model where the first network element to touch the packet is implemented mostly in software. This allows the hope of substantially increasing their capability. A few examples:

Datacenters: The first hop is a software switch running in the Hypervisor, like the VMware vSwitch or Cisco Nexus 1000v.

Wide Area Networks: WAN optimizers have become quite popular because they save money by reducing the amount of traffic sent over the WAN. These are mostly software products at this point, implementing protocol-specific compression and deduplication. Forthcoming 10 Gig products from Infineta appear to be the first products containing significant amounts of custom hardware.

Wifi AP with CPU, Wifi MAC, and Ethernet MAC

Wifi Access Points: Traditional, thick APs as seen in the consumer and carrier-provided equipment market are a CPU with Ethernet and Wifi, forwarding packets in software.
Thin APs for Enterprise use as deployed by Aruba/Airespace/etc are rather different, the real forwarding happens in hardware back at a central controller.

Carrier Network Access Units: Like Wifi APs, access gear for DSL and DOCSIS networks is usually a CPU with the appropriate peripherals and forwards frames in software.

Enterprise switch with CPU handling all packets, and a big red X through it

Enterprise: Just kidding, the Enterprise is still firmly in the "more hardware == more better" category. Most of the problems to be solved in Enterprise networking today deal with access control, security, and malware containment. Though CPU forwarding at the edge is one solution to that (attempted by ConSentry and Nevis, among others), the industry mostly settled on out of band approaches.

The Computer is the Network

The Sun Microsystems tagline through most of the 1980s was The Network is the Computer. At the time it referred to client-server computing like NFS and RPC, though the modern web has made this a reality for many people who spend most of their computing time with social and communication applications via the web. Its a shame that Sun itself didn't live to see the day.

We're now entering an era where the Computer is the Network. We don't want to depend upon the end-station itself to mark its packets appropriately, mainly due to security and malware considerations, but we want the flexibility of having software touch every packet. Market segments which provide that capability, like datacenters, WAN connections, and even service providers, are going to be a lot more interesting in the next several years.

Friday, October 28, 2011

Tweetflection Point

Last week at the Web 2.0 Summit in San Francisco, Twitter CEO Dick Costolo talked about recent growth in the service and how iOS5 had caused a sudden 3x jump in signups. He also said daily Tweet volume had reached 250 million. There are many, many estimates of the volume of Tweets sent, but I know of only three which are verifiable as directly from Twitter:

50M tweets/day in March, 2010 according to a Twitter blog post.
140M tweets/day in March, 2011 according to that same Twitter blog post.
250M tweets/day in late October, 2011 according to Dick Costolo.

Graphing these on a log scale shows the rate of growth in Tweet volume, ~~roughly tripling in two years~~ almost tripling in one year.

This graph is misleading though, as we have so few data points. It is very likely that, like signups for the service, the rate of growth in tweet volume suddenly increased after iOS5 shipped. Lets assume the rate of growth also tripled for the few days after the iOS5 launch, and zoom in on the tail end of the graph. It is quite similar up until a sharp uptick at the end.

Speculative graph of average daily Tweet volume, knee of curve at iOS5 launch.

The reality is somewhere between those two graphs, but likely still steep enough to be terrifying to the engineers involved. iOS5 will absolutely have an impact on the daily volume of Tweets, it would be ludicrous to think otherwise. It probably isn't so abrupt a knee in the curve as shown here, but it has to be substantial. Tweet growth is on a new and steeper slope now. It used to triple in a bit over a year, now it will triple in way less than one year.

Why this matters

Even five months ago, the traffic to carry the Twitter Firehose was becoming a challenge to handle. At that time the average throughput was 35 Mbps, with spikes up to about 138 Mbps. Scaling those numbers to today would be 56 Mbps sustained with spikes to 223 Mbps, and about one year until the spikes exceed a gigabit.

The indications I've seen are that the feed from Twitter is still sent uncompressed. Compressing using gzip (or Snappy) would gain some breathing room, but not solve the underlying problem. The underlying problem is that the volume of data is increasing way, way faster than the capacity of the network and computing elements tasked with handling it. Compression can reduce the absolute number of bits being sent (at the cost of even more CPU), but not reduce the rate of growth.

Fundamentally, there is a limit to how fast a single HTTP stream can go. As described in the post earlier this year, we've scaled network and CPU capacity by going horizontal and spreading load across more elements. Use of a single very fast TCP flow restricts the handling to a single network link and single CPU in a number of places. The network capacity has some headroom still, particularly by throwing money at it in the form of 10G Ethernet links. The capacity of a single CPU core to process the TCP stream is the more serious bottleneck. At some point relatively soon it will be more cost effective to split the Twitter firehose across multiple TCP streams, for easier scaling. The Tweet ID (or a new sequence number) could put tweets back into an absolute order when needed.

Unbalanced link aggregation with a single high speed HTTP firehose.

Update: My math was off. Even before the iOS5 announcement, the rate of growth was nearly tripling in one year. Corrected post.