Sunday, April 1, 2012

BangIP Option

We're almost out of IPv4 addresses, yet IPv6 deployment is still very, very slow. This is a recipe for disaster. I'm talking End of Internet predicted, film at 11 scale disaster. Something must be done. Steps must be taken.

I humbly suggest that the solution is really quite simple: re-use IPv4 addresses. I don't mean reclaiming IP addresses which have been allocated but are not actually being used. That is the kind of solution which sounds good and righteous to say but is completely unworkable in practice. Instead, I mean to simply issue the same IP address to multiple people.

"How could that possibly work?" people might ask. Go ahead, ask it... I'm glad that you asked, because I have a ready-made explanation waiting. We will leverage a solution to a similar problem which has scaled tremendously well over the last several decades: email addresses.

On early email systems like AOL, addresses had to be unique. This led to such ridiculous conventions as appending a number to the end of a popular address, like CompuGuy112673 (because really, everybody wants to have an email address like CompuGuy). The beauty of Internet email addressing is in breaking them up into a federated system, so and can simultaneously exist.

Therefore I propose to use this same solution for IP addressing. We will allow each Autonomous System to maintain its own IP address space. The same IP address can simultaneously exist within multiple ASNs. We are effectively adding a level of indirection to the destination in the IP header. Disambiguating the actual destination will be done via a new IP option.

IP Option field appended to frameBinary encoding of protocol headers is now passé. Therefore our new IP option is ASCII encoded, which also allows us to avoid the discussion about how Little Endian CPUs have won and all headers should be LE instead of BE. The content of the option is the destination IP address, followed by the '@' sign, followed by the decimal Autonomous System Number where the recipient can be reached. For example, a destination of within the Level 3 network space would be "" using Level3's primary ASN of 3356.

By my reading of the IP spec the option field can be up to 40 bytes, while an IP@ASN encoding could require up to 15 bytes for the IP address, 1 byte for the @ symbol, and could handle billions of IP namespaces using only 13 more bytes to encode a 32 bit ASN. So, we should be good to go.

Updates to the DNS A record and all the application code which looks up IP addresses is left as an exercise for the reader.

For historical reasons, this new option is referred to as the BangIP option. An earlier version of this system used bang-separated IP addresses as a source routing path. That portion of the proposal has been deprecated, but we retain the name for its sentimental value.

Monday, March 26, 2012

On Octal

An article in the March 8 issue of the journal PLoS Computational Biology (as reported by Science Daily) states:

Upon pre-synaptic excitation, calcium ions entering post-synaptic neurons cause the snowflake-shaped CaMKII to transform, extending sets of 6 leg-like kinase domains above and below a central domain, the activated CaMKII resembling a double-sided insect. Each kinase domain can phosphorylate a substrate, and thus encode one bit of synaptic information. Ordered arrays of bits are termed bytes, and 6 kinase domains on one side of each CaMKII can thus phosphorylate and encode calcium-mediated synaptic inputs as 6-bit bytes.

From this we can derive one inescapable conclusion: DEC was right about octal all along.

(Thanks to Sean Hafeez for posting a link to the Science Daily article on Google+)

Thursday, February 9, 2012

ATA Commands in Python

In software development sometimes you spend time on an implementation which you are unreasonably proud of, but ultimately decide not to use in the product. This is one such story.

I needed to retrieve information from an attached disk, such as its model and serial number. There are commands which can do this, like hdparm, sdparm, and smartctl, but initially I tried to avoid building in a dependency on any such tools by interrogating it from the hard drive directly. In pure Python.

Snicker all you want, it did work. My first implementation used an older Linux API to retrieve this information, the HDIO_GET_IDENTITY ioctl. This ioctl maps more or less directly to an ATA IDENTIFY or SCSI INQUIRY from the drive. The implementation uses the Python struct module to define the data structure sent along with the ioctl.

def GetDriveId(dev):
  """Return information from interrogating the drive.

  This routine issues a HDIO_GET_IDENTITY ioctl to a block device,
  which only root can do.

    dev: name of the device, such as 'sda' or '/dev/sda'

    (serial_number, fw_version, model) as strings
  # from /usr/include/linux/hdreg.h, struct hd_driveid
  # 10H = misc stuff, mostly deprecated
  # 20s = serial_no
  # 3H  = misc stuff
  # 8s  = fw_rev
  # 40s = model
  # ... plus a bunch more stuff we don't care about.
  struct_hd_driveid = '@ 10H 20s 3H 8s 40s'
  if dev[0] != '/':
    dev = '/dev/' + dev
  with open(dev, 'r') as fd:
    buf = fcntl.ioctl(fd, HDIO_GET_IDENTITY, ' ' * 512)
    fields = struct.unpack_from(struct_hd_driveid, buf)
    serial_no = fields[10].strip()
    fw_rev = fields[14].strip()
    model = fields[15].strip()
    return (serial_no, fw_rev, model)

No no wait, stop snickering, it does work! It has to run as root, which is one reason why I eventually abandoned this approach.

$ sudo python
('5RY0N6BD', '3.ADA', 'ST3250310AS')

HDIO_GET_IDENTITY is deprecated in Linux 2.6, and logs a message saying that sg_io should be used instead. sg_io is an API to send SCSI commands to a device. sg_io also didn't require my entire Python process to run as root, I'd "only" have to give it CAP_SYS_RAWIO. So of course I changed the implementation... still in Python. Stop snickering.

class AtaCmd(ctypes.Structure):
  """ATA Command Pass-Through"""

  _fields_ = [
      ('opcode', ctypes.c_ubyte),
      ('protocol', ctypes.c_ubyte),
      ('flags', ctypes.c_ubyte),
      ('features', ctypes.c_ubyte),
      ('sector_count', ctypes.c_ubyte),
      ('lba_low', ctypes.c_ubyte),
      ('lba_mid', ctypes.c_ubyte),
      ('lba_high', ctypes.c_ubyte),
      ('device', ctypes.c_ubyte),
      ('command', ctypes.c_ubyte),
      ('reserved', ctypes.c_ubyte),
      ('control', ctypes.c_ubyte) ]

class SgioHdr(ctypes.Structure):
  """<scsi/sg.h> sg_io_hdr_t."""

  _fields_ = [
      ('interface_id', ctypes.c_int),
      ('dxfer_direction', ctypes.c_int),
      ('cmd_len', ctypes.c_ubyte),
      ('mx_sb_len', ctypes.c_ubyte),
      ('iovec_count', ctypes.c_ushort),
      ('dxfer_len', ctypes.c_uint),
      ('dxferp', ctypes.c_void_p),
      ('cmdp', ctypes.c_void_p),
      ('sbp', ctypes.c_void_p),
      ('timeout', ctypes.c_uint),
      ('flags', ctypes.c_uint),
      ('pack_id', ctypes.c_int),
      ('usr_ptr', ctypes.c_void_p),
      ('status', ctypes.c_ubyte),
      ('masked_status', ctypes.c_ubyte),
      ('msg_status', ctypes.c_ubyte),
      ('sb_len_wr', ctypes.c_ubyte),
      ('host_status', ctypes.c_ushort),
      ('driver_status', ctypes.c_ushort),
      ('resid', ctypes.c_int),
      ('duration', ctypes.c_uint),
      ('info', ctypes.c_uint)]

def SwapString(str):
  """Swap 16 bit words within a string.

  String data from an ATA IDENTIFY appears byteswapped, even on little-endian
  achitectures. I don't know why. Other disk utilities I've looked at also
  byte-swap strings, and contain comments that this needs to be done on all
  platforms not just big-endian ones. So... yeah.
  s = []
  for x in range(0, len(str) - 1, 2):
  return ''.join(s).strip()

def GetDriveIdSgIo(dev):
  """Return information from interrogating the drive.

  This routine issues a SG_IO ioctl to a block device, which
  requires either root privileges or the CAP_SYS_RAWIO capability.

    dev: name of the device, such as 'sda' or '/dev/sda'

    (serial_number, fw_version, model) as strings

  if dev[0] != '/':
    dev = '/dev/' + dev
  ata_cmd = AtaCmd(opcode=0xa1,  # ATA PASS-THROUGH (12)
                   protocol=4<<1,  # PIO Data-In
                   # flags field
                   # OFF_LINE = 0 (0 seconds offline)
                   # CK_COND = 1 (copy sense data in response)
                   # T_DIR = 1 (transfer from the ATA device)
                   # BYT_BLOK = 1 (length is in blocks, not bytes)
                   # T_LENGTH = 2 (transfer length in the SECTOR_COUNT field)
                   features=0, sector_count=0,
                   lba_low=0, lba_mid=0, lba_high=0,
                   command=0xec,  # IDENTIFY
                   reserved=0, control=0)
  ASCII_S = 83
  sense = ctypes.c_buffer(64)
  identify = ctypes.c_buffer(512)
  sgio = SgioHdr(interface_id=ASCII_S, dxfer_direction=SG_DXFER_FROM_DEV,
                 mx_sb_len=ctypes.sizeof(sense), iovec_count=0,
                 dxferp=ctypes.cast(identify, ctypes.c_void_p),
                 sbp=ctypes.cast(sense, ctypes.c_void_p), timeout=3000,
                 flags=0, pack_id=0, usr_ptr=None, status=0, masked_status=0,
                 msg_status=0, sb_len_wr=0, host_status=0, driver_status=0,
                 resid=0, duration=0, info=0)
  SG_IO = 0x2285  # <scsi/sg.h>
  with open(dev, 'r') as fd:
    if fcntl.ioctl(fd, SG_IO, ctypes.addressof(sgio)) != 0:
      print "fcntl failed"
      return None
    if ord(sense[0]) != 0x72 or ord(sense[8]) != 0x9 or ord(sense[9]) != 0xc:
      return None
    # IDENTIFY format as defined on pg 91 of
    serial_no = SwapString(identify[20:40])
    fw_rev = SwapString(identify[46:53])
    model = SwapString(identify[54:93])
    return (serial_no, fw_rev, model)

For the unbelievers out there, this one works too.

$ sudo python
('5RY0N6BD', '3.ADA', 'ST3250310AS')

So, there you go. HDIO_GET_IDENTITY and SG_IO implemented in pure Python. They do work, but in the process of working on this and reading existing code it became clear that low level ATA handling is fraught with peril. The disk industry has been iterating this interface for decades, and there is a ton of gear out there that made questionable choices in how to interpret the spec. Most of the code in existing utilities is not to implement the base operations but instead to handle the quirks from various manufacturers. I decided that I didn't want to go down that path, and will instead rely on forking sdparm and smartctl as needed.

I'll just leave this post here for search engines to find. I'm sure there is a ton of demand for this information.

Stop snickering.

Thursday, February 2, 2012


In addition to its role in assigning IPv4 addresses, DHCP has an options mechanism to send other bits of data between client and server. For example there are options to provide the DNS and NTP server addresses along with the client's assigned IP address. In its request, the client lists the options it would like to receive. The server fills in whichever ones it can.

DHCP has always allowed for vendor extensions of the available options, inheriting this support from the earlier BOOTP protocol. The original vendor extension mechanism was very simple: use option 43, and put whatever you like there. A mechanism was provided to encode multiple sub-options within the option 43 data, but made no attempt to coordinate between different vendors use of the space. Vendors immediately began creating conflicts, using the same numeric codes for different means simultaneously. This led to a variety of heroic hacks in which the client would populate its request with magic values which the server would use to figure out the set of vendor options to supply.

DHCP6 defined a more complex encoding, where each vendor includes their unique IANA Enterprise Number as part of its option. Options from different vendors can be accommodated simultaneously. This Vendor-Identifying Vendor Options (VIVO) encoding was also added back to DHCP4 as options 124 and 125. DHCP4 thus has two separate vendor option mechanisms in common use.


The ISC DHCP server can support VIVO options using several different mechanisms. The first, and so far as I can tell most common, is to specify the byte-level payload. The administrator pores over RFCs and vendor documentation to come up with the magic string of bytes to send and types it into dhcpd.conf, where it immediately becomes magic voodoo that everyone is afraid to touch for fear of breaking something.

Avoiding magic byte strings by specifying the format of the options is more difficult to get working, but easier to maintain and understand. We'll consider an example here.

  • Vendor: Frobozzco
  • IANA Enterprise Number (IEN): 12345
  • Code #1: a text string containing the location within the maze.
  • Code #2: an integer describing the percentage likelihood of being eaten by a grue.
    In practice this is always 100%, which many clients simply hard-code.

To implement this in the DHCP config, we define an option space for frobozzco. This just creates a namespace; we bind that namespace to the numeric IEN later. We have to tell DHCP how wide to make the code and length fields. DHCP4 usually uses 1 byte fields, while DHCP6 generally uses 2 bytes. Most of the time vendors don't specify the width they use, and if so you should assume the normal sizes for the protocol. The example below comes from a DHCP6 config, so the code and length are both declared as two bytes. After we've declared all of the option codes, we bind the frobozzco option space to its numeric IANA Enterprise Number. Use of IEN for DHCP is called the Vendor Specific Information Option, so the syntax in the DHCP configuration labels this vsio.{option space name}

option space frobozzco code width 2 length width 2;
option frobozzco.maze-location code 1 = text;
option frobozzco.grue-probability code 2 = uint8;
option vsio.frobozzco code 12345 = encapsulate frobozzco;

Owing to the long and sordid history of numbering conflicts, most vendor extensions define a secret handshake. The client inserts a specific value into a field in its request to trigger the server to respond with the options for that vendor. Frobozzco has decreed that clients should send the string "look north" as a vendor-class option in the request. A DHCP6 vendor-class consists of the vendor's IEN followed by the content. In our case the content consists of another two byte length field, followed by the string. ISC DHCP 4.x doesn't define a type for handling vendor-class in the config but we can construct one using a record, which is a collection of fields defined inside brackets.

option dhcp6.vendor-class code 16 = {integer 32, integer 16, string};

# length=14 bytes, Frobozzco IEN, content=look north
send dhcp6.vendor-class 12345 14 "look north";

Finally, we have to provide a script for dhclient to run to handle the received options. We'll get to the client script a bit later, for now just assume it is in /usr/local/sbin/dhclient-script. Putting it all together, the dhclient6.conf should look like this.

script "/usr/local/sbin/dhclient-script";

option space frobozzco code width 2 length width 2;
option frobozzco.maze-location code 1 = text;
option frobozzco.grue-probability code 2 = uint8;
option vsio.frobozzco code 12345 = encapsulate frobozzco;

option dhcp6.vendor-class code 16 = {integer 32, integer 16, string};

interface "eth0" {
    also request dhcp6.vendor-opts;
    send dhcp6.vendor-class 12345 10 "look north";


On the client we also must provide the script for dhclient to run. The OS vendor will have provided one, often in /sbin or /usr/sbin. We'll copy it, and add handling.

dhclient passes in environment variables for each DHCP option. The name of the variable is "new_<option space name>_<option name>" For the example config above, we'd define a shell script function to write our two options to files in /tmp.

make_frobozzco_files() {                                                       
  mkdir /tmp/frobozzco                                            
  if [ "x${new_frobozzco_maze_location}" != x ] ; then         
    echo ${new_frobozzco_maze_location} > /tmp/frobozzco/maze_location                 
  if [ "x${new_frobozzco_grue_probability}" != x ] ; then
    echo ${new_frobozzco_grue_probability} > /tmp/frobozzco/grue_probability

The dhclient-script provided with the OS will have handling for DNS nameservers. Adding a call to make_frobozzco_files at the same points in the script which handle /etc/resolv.conf is a reasonable approach to take.

I'm mostly blogging this for my own future use, to be able to find how to do something I remember doing before. There you go, future me.

Tuesday, January 31, 2012


When building the software for a new device there are a ton of things which need to go into the filesystem: binaries, libraries, device nodes, a bunch of configuration files, etc. One of the chunks of essential data is a list of trusted certificate authorities for libssl, commonly stored in /etc/ssl/ca_certificates.crt. Its common practice to grab a list of certificates from, and there are various Perl and Python scripts floating around in the search engines to assemble this list into a PEM file suitable for libssl.

2011 was not a good year for certificate authorities. DigiNotar was seized by the Dutch government after it became clear they had been thoroughly breached and generated fraudulent certificates for many large domains. Several Comodo resellers were similarly compromised and generated bogus certs for some of the same sites. Browser makers responded by encoding a list of toxic certificates into the browser, to reject any certificate signed by them.

Encoding a list of toxic certificates is the key phrase in that paragraph. As of 2011, Mozilla's certdata.txt contains both trusted CAs and absolutely untrustworthy, revoked CAs. There is metadata in the entry describing how it should be treated, but several of the scripts floating around grab everything listed in certdata.txt and put it in the PEM file. This is disastrous.

The code to search for is CKT_NSS_NOT_TRUSTED. The utility you are using should check for that type, and skip it. If there is no handling for CKT_NSS_NOT_TRUSTED, then the utility you are using is absolutely broken. Don't use it. I know of at least two which handle this correctly:

Wednesday, January 18, 2012


SOPA isn't dead. It hasn't been defeated. It hasn't been stopped. Its just regrouping.

Their main mistake was in allowing it to become publicly known too long before a decisive vote. Its backers will try again, next time ramming it through in the dead of night. They'll give it a scary title, as anything can be justified if the title of the bill is scary enough.

Bills like SOPA are an attempt to legislate a return to media economics the way it used to be, where the sheer cost of distributing content formed a high barrier to entry. Its the economics of scarcity. Better yet, the law would require someone else to pay the cost of creating this scarcity. If the cost of any infringement, intentional or not, third party or first party, can be made so overwhelming as to be ruinous (and incidentally decoupled from any notion of the actual damage from the infringement), then cheap distribution via the Internet can be made expensive again. We can get back to the cozy media business of prior decades.

Its time to stop playing defense, desperately trying to stop each of these bills.

Its time to start playing offense.

The workings of government are obscure and impenetrable. There are reams of data produced in the form of minutes, committee reports, the Federal Register, and other minutiae, but the whole remains an opaque mass. Lobbyists and political operatives thrive in this environment, as they understand more about the mechanisms by which it operates. Yet one of the recent core competencies of the technology industry is Big Data. There are conclusions which can be drawn from trends within the dataset without having to semantically understand all of it.

I have to believe there are things the tech industry can do beyond simply increasing lobbying budgets.

Friday, January 13, 2012

Merchant Silicon

Last week Greg Ferro wrote about the use of merchant silicon in networking products. I'd like to share some thoughts on the topic.

Chip cost

Fistful of dollarsWe know the story by now: by selling chips into products from multiple networking companies, commodity chips sell in large volume and benefit from larger discounts. This is a compelling factor in the low end of the switching market, where margins are thin and a primary selling point is price.

Yet low price of the switch silicon is not a decisive factor in the midrange and high end of the switching market, where products are more featureful and sell at higher prices. The price of those products is not based on the cost of materials, its based on what the market will bear. The market has traditionally borne a lot: Cisco's profit margins in these segments have been legendary for a decade.

In my experience, chip price was not a decisive factor in the wholesale move to merchant silicon.

Non Recurring Engineering (NRE cost)

Silicon chipSay it costs $10 million to design a chipset for a high end switch, and the resulting set of silicon costs $500 in the expected volumes. If that high end switch sells 10,000 units in its first year then the NRE cost for developing it amounts to $1,000 per unit, double the cost of buying the chip itself. The longer the model remains in production the more its cost can be amortized... but the company has to pay the complete cost to develop the silicon before the first unit is sold.

In the midrange and high end switch markets, the strongest pitch made by merchant semiconductor suppliers wasn't the per-chip cost. A stronger pitch for those segments was elimination of NRE. The networking company didn't have to bear the cost of chip development up front. The company did pay the cost of development, but it would be factored into the unit price and pay-as-you-go rather than upfront.

Yet even NRE savings wasn't usually enough to convince a networking company to give up its own ASIC designs. Most realized that to do so was to give up a substantial portion of their ability to differentiate their products. Several vendors adopted a hybrid approach. They used merchant silicon to provide the fabric and handle simple cases of packet forwarding, and configured flow rules to steer interesting packets out to a separate device for additional handling. Costs were reduced by only having to design that specialized chip for a subset of the total traffic through the box, but they retained an ability to differentiate features.

In my experience, eliminating the burden of NRE was not a decisive factor in the move to merchant silicon.


Gantt chartThe merchant silicon vendors of the world can dedicate more ASIC engineers to their projects. This isn't as big a win as it sounds: tripling the size of the design team does not result in a chip with 3x the features or in 1/3rd the time. As with software projects (see The Mythical Man Month), the increasing coordination overhead of a larger team results in steeply diminishing returns.

Instead, merchant silicon vendors have the luxury of working on multiple projects in parallel. They can have two teams leapfrogging each other, each working on a multiyear timeline and introducing their products in interleaving years. Alternately, they can target different chips at different market segments. They rely on their SDK to hide gratuitous differences which they happened to introduce, and only make their customers deal with the truly differentiating features of the different chips.

It is difficult to make a case to spend two years to develop custom silicon for a product when merchant silicon with sufficient features is expected to be available a year earlier. Merchant silicon suppliers share details of their roadmap very early, even before the feature set is finalized. This lets them incorporate feedback into the features for the final product, but they also do it to derail in-house silicon efforts.

Yet in my experience at least, though schedule is a decisive factor, this isn't the full story.

Misaligned Incentives

When leading a chip development effort, the biggest fear is not that the chip will have bugs. Many ASIC bugs can be worked around in software.

The biggest fear is not that the chip will be larger and more costly than planned. That is a negotiation issue with the silicon fab, or a business issue in positioning the product.

The biggest fear is that the chip will be late. Missing the market window is the worst kind of failure for an ASIC. The design team produces a chip which meets all requirements, but comes at a time when the market no longer cares. The tardy product will face significant pricing pressure on the day of introduction, more so the longer competitive products have been available.

The technical leadership of an internal ASIC project is therefore incented to plan a schedule which they are sure they can meet. They'll use realistic timelines for the different phases of the product, and include sufficient padding to handle unexpected problems. They will produce best case timelines as well, but those tend to be discounted by the project leadership as unrealistic.

The technical leadership inside merchant semiconductor companies face the same issues, and produce the same sort of schedule which they are confident they can meet. The difference is, that conservative schedule is not handed out to the decision makers at the customer networking vendors. A more optimistic schedule is maintained and presented to customers - not rosy best case, but certainly optimistic. Everybody knows that schedule will slip, even the customers themselves... but nonetheless it works. Customers work from the optimistic schedule because that is all they have. It increases the difference in schedule between in-house and merchant options by several quarters.

The Point of No Return

ASIC design requires some rather specialized skill sets. There is a great deal of similarity between chip design and software design, but not so much that one can switch freely back and forth. If there is not an active chip development effort underway, the ASIC team tends to run out of interesting things to work on.

When a company begins seriously contemplating building their high end products using merchant silicon, even if the management tries to keep it low key, it becomes pretty obvious internally. You have to pull in senior technical folks from the software, hardware, and ASIC teams to help with the evaluation. News spreads. Gossip spreads faster. If the ASIC team becomes convinced that there will be no further chip projects, they start to move on.

It can easily become a self-fulfilling prophecy: serious consideration of a move to merchant silicon leads to loss of the capability to develop custom ASICs.

Why it Matters

We talk a lot about Software Defined Networking. The term, consciously or not, tends to make people think the networking is all in software and the hardware is insignificant. That isn't actually true, as the SDN can only utilize actions which the hardware can actually do, but it illustrates how much less we value we put in hardware now.

In the context of SDN, reducing switch hardware diversity is actually a good thing. It results in a more uniform set of capabilities in networks, and a smaller set of cases for the SDN controller to have to handle. Networking used to be dominated by the hardware designs, but it has moved on now. I think that is a good thing.