Tuesday, June 28, 2011

Out of Business Before It Happens

Switch fabric showing CRC checked at ingress and discardedEthernet frames carry a CRC32 to ensure data integrity. It is often assumed that this CRC is carried all the way across the network, but in reality there are cases where the switch fabric modifies the frame and calculates a new CRC at egress. For example:

  • L3 routing, to decrement TTL and rewrite MAC DA
  • Vlan tag insertion/removal
  • QoS classification and overwrite of DSCP/802.1p bits

2011 market share of switch fabrics with at least one of these features: 100%

Even the ultra-cheap five port switches sold at retail contain a fabric capable of reasonably sophisticated packet handling. The functionality is not exposed in the software for market segmentation reasons, but the hardware can do it.

ASIC designers hate special cases, they add test burden and risk. Therefore because there exist logic paths which need to check the CRC at ingress and discard it, the chip will always check the CRC at ingress and discard it. Data inside the chip will be protected somehow, if only to counter soft errors, but there is no way to know what they've done. It might be ECC, it might be a weak parity scheme.

Paths across the Internet that we think a CRC protects us the whole way, when in reality its only the links not the switches.

I've worked on a number of different designs for switch fabrics, NICs, forwarding engines, and packet processors. In bringing up these systems I have twice run into hardware bugs which corrupted packet data and then calculated a fresh CRC over the corrupted data. In both of those cases the bug was fixed. Yet if I know of two such examples, how many more products in the market slipped out with hidden data corruption bugs?

As software developers, what does this mean? It means the application needs to check the integrity of its own data. The network can't be trusted to do it for you.

This is not a theoretical problem. A flipped bit in a control packet caused an Amazon S3 outage in 2008. There were Ethernet CRCs at each hop along the link; the problem happened inside one of the switches or NICs. Amazon added MD5 checks to all control messages to prevent a repeat of the problem. Believing that it only happens at Amazon scale is another way of saying you expect to be out of business before it happens to you. Statistically, its only a matter of time.

Applications need end-to-end integrity checking. CRC32 carried as part of the payload is fine for this, and is quite efficient on modern CPUs. MD5 and SHA1 are also good choices. MD5 is no longer suitable for security purposes, but for data integrity it is quite strong. Using SSL guarantees integrity of the data over the Internet, though the network within the datacenter must be protected somehow. Given the increasingly heroic measures disks must take for ECC, storing the checksums alongside the data is a good idea too.

The most important thing is to do something.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.


Wednesday, June 22, 2011

Holtzmann Shields in All the Walls

I am a huge fan of Dune, by Frank Herbert. It posits a star-spanning yet feudal society. It relies on a plot device, the Holtzmann shield, to explain why armies of a technologically advanced civilization would eschew ranged weapons and rely on knife fighting as a primary offensive technique.

Frank Herbert provides the following details:

A shield will permit entry only to objects moving at slow speeds (depending on setting, this speed ranges from six to nine centimeters per second)...

Hmm...


The molecular speed of a gas can be calculated by:

V equals the square root of 3 times R times T over M

where:
V = root mean squared velocity
R = universal gas constant, 8.3145 Joules/(mole Kelvin)
T = temperature in Kelvin
M = Molar mass


Rearranging:

T equals V squared times M over 3 R

We'll use the upper bound of the speed which can penetrate the Holtzmann shield of 9 cm/sec, and assume the molar mass of air on Earth of one International Standard Atmosphere. We'll also swap in the definition of a Joule as meters squared kilograms per second squared. Substituting:

substituting in numeric values to the previous equation

Working it all out, a gas molecule moving less than 9 cm/sec has a temperature of no more than 0.000009405722533 Kelvin. From this we conclude that the Holtzmann shield is a nearly perfect thermal insulator. Air can only penetrate the shield if it is close to absolute zero.

This leads to the interesting question of what would happen to the shield user first: suffocation, or heat stroke? However that would be taking the joke too far.


Of course, I've probably already taken this too far. The shield is a fictional device, and it serves its literary purpose quite well. If the shield so literally blocked any atomic particle, the vibration of atoms at the tip of the knife would prevent its passage.

Yet this article also served a way to learn about display of math formulae in web pages. I started with MathML, thinking that markup would be the One True Way to display information on the web. I had formatted the equations using LaTeX, and converted them to MathML. This turns out not to be ready for normal people to use. WebKit contains a chunk of implementation, but neither Safari 5 nor Chrome 14 enable rendering of MathML.

I was about to give up and output LaTeX to PNG when Mandy Waite of Google Devrel suggested a far better solution: the Charts API. Though more frequently used to produce pie charts and line graphs, the Charts API can also render forumlae. Even better, its input is specified via LaTeX. For example:

LaTeX:  V = \sqrt{\frac{3RT}{M}}
Charts API:  https://chart.googleapis.com/chart?cht=tx&chl=V%20%3D%20%5Csqrt{%5Cfrac{3RT}{M}}

The chl parameter for the Charts API is simply the LaTeX statement, URL-encoded. W00t!


If you'd like to correct my half-remembered Physics, the comments section is open.

Monday, June 20, 2011

Peerspective

Peerspective [peer-spek-tiv] (noun) : Suddenly remembering important feedback which should have gone into the peer review submitted last week.
"Oh no, I forgot about their contribution to the Big Project!"

Wednesday, June 15, 2011

Factorecursion

Factorecursion [fak-tuh-ree-kur-zhuhn] (noun) : Very clever, young man, very clever, but its factories all the way down.

Foo  FooFactory();
FooFactory  FooFactoryFactory();
FooFactoryFactory  FooFactoryFactoryFactory();
FooFactoryFactoryFactory  FooFactoryFactoryFactoryFactory();

Sunday, June 12, 2011

No Exceptions

The human kind, not computing.

What you say: THERE WILL BE NO EXCEPTIONS TO THIS POLICY.
What I hear: We realize that you may find this policy unreasonable. We want to make it perfectly clear that we do not care what you think.
What you are thinking: We have more important things to spend our time on than this.
What I am thinking: You'll spend more and more of your time and energy enforcing your authority until you find that you have no authority left.

Thursday, June 9, 2011

Grand Unified Theories of Switching

This is the fourth and final article in a series discussing hash engines versus CAMs for network switch chips. The first, second, and third installments were published earlier this week.

Having spilled so much digital ink discussing hash engines in switch hardware, we finally come to the point of the exercise.


 
Elegance, Meet Implementation

We're currently witnessing a huge shift in networking, with technologies to let external entities take direct control of forwarding decisions in the switch. Current work on this area focusses on allowing flexible extraction of fields from the packet, and definition of flow rules to take action based on matches of any field.

The overwhelming temptation is to define a superset key containing every interesting field from the packet, and use masks to implement specific functions. For example, consider a 9-tuple:

Nine fields from the packet, with masks to perform routing, ACL, and QoS

We then populate the table with entries, and use masking to implement the specific functions desired.

Lookup: ingress
port
VLAN Ethertype L3 Src L3 Dst IP Proto DSCP L4 Src L4 Dst
Nexthop/24 0 | 0x0 0 | 0x0 0 | 0x0 0 | 0x0 10.1.1.0 | 0xffffff00 0 | 0x0 0 | 0x0 0 | 0x0 0 | 0x0
L3 ACL 0 | 0x0 0 | 0x0 0 | 0x0 0 | 0x0 199.59.148.82 | 0xffffffff 6 | 0xff 0 | 0x0 0 | 0x0 80 | 0xffff
QoS Assign 0 | 0x0 0 | 0x0 0 | 0x0 0 | 0x0 0 | 0x0 6 | 0xff 1 | 0xff 0 | 0x0 22 | 0xffff

As an abstraction this seems wonderful: a Grand Unified Theory of Switching. Any packet handling can be expressed using the same mechanism and notation, and we can even use set theory where rules overlap.

As a basis for designing switch hardware, this has some drawbacks. Not only does the reliance on masking mandate a CAM, it places quite a burden on that CAM. The key pictured here is enormously wide: the IPv4 version would be 152 bits, 344 bits for IPv6, and there are still more fields we might consider adding. The key format is variable, the hardware wouldn't know in advance how big it will be. A common key width for CAM components is 144 to 147 bits wide. Matching extremely wide keys requires chaining, where the CAM burns another lookup cycle to handle more key bits. A protocol encouraging very wide keys will seriously tax CAM capacity and performance.


 
Whither Hash Engines?

Hash engines store their patterns in relatively inexpensive memory like SRAM. At any given price point you can store a much larger number of entries in a hash table than a CAM. Consider some functions in network gear today which hash engines are suited for, as they require either no masking or a small set of fixed masks.

  • In provider networks we rely on the ingress router to handle all classification of a packet, determining the path it will follow and the QoS it will carry. It prepends MPLS labels for the core to rely on.
  • In corporate networks nowadays we use an extensive set of ACLs to protect the network from unauthorized access, as the industry has completely failed to deliver identity-based networks and NAC.
  • In very large L2 domains (using TRILL or similar solutions) we track a large number of MAC addresses. Masking of MAC addresses makes little sense, they are always straightforward lookups.

A Grand Unified Theory of Networking can handle all of these functions, but it does so by turning them into big keys with masks per entry. They become unsuitable for hash engine implementations.


 
Some Modest Suggestions

The crux of the disconnect of protocol specification to hardware implementation is that each entry carries a mask. At first glance this looks like it would map very well to CAMs. However this leads to very wide keys (bad for CAMs), because its so easy to define one key and mask off the fields which aren't needed. Additionally, the ability to add numerous mask patterns makes hash engine use almost impossible.

Suggestions:

  1. Require that masks be defined separately from individual entries, and encourage them to be limited in number. CAM vendors are moving to an architecture with a pool of global masks, and hash engines require an even more limited number. The protocol should accommodate this.
    Giving the switch the ability to reject new mask definitions would let hash engines optimize by expanding entries, but this makes applications hard to write: what do they do if the switch rejects something they require? Letting the switch advertise the number of masks it can handle would be another solution, and makes the applications easier to write.
  2. Make it easy to define multiple key formats. It is straightforward for hardware to extract different keys from the packet for different lookups. Much of what a protocol might want to do with masking can be accomplished by defining its own key format instead. The switch would need to advertise the number of formats it can support (and most existing gear would support only 1).

This concludes this series of articles about hash engines in switch hardware. Summary: TL;DR, blah-blah-hash-blah.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Wednesday, June 8, 2011

Ternary Hashing

This is the third in a series of articles discussing hash engines for network switch chips. It is best to start with the first and second installments.


CAMs for networking applications support ternary masking, the ability to mark bits of the key as don't care. This is useful for a number of applications. Consider ACLs: the switch can drop packets which match a particular IP address, L4 port, ingress port, etc. Most of the possible match fields will be marked as don't care, and only those fields for the specific ACL will be populated.

Hash engines can support masking, with some limitations. Once computed, you don't mask off bits of the hash value. Instead you mask bits of the original key and compute a new hash. The possible mask values must be known in advance, and each mask requires its own lookup cycle in the table. If a mask excludes the majority of bits in the key it can result in collision problems (too little variance in the keys), but masking off so many bits usually means there are few possible matches anyway.

Lookup #Mask Notes
1. 0xffffffff0000L3 ACL
2. 0x00000000ffffL4 ACL
3. 0xffffffffffffL3+L4 ACL

This isn't the way CAMs work. The unmodified key is presented to the CAM, containing all fields, and each entry in the CAM performs its own masking of what bits to ignore. However even CAM vendors are rethinking the utility of having a unique mask for every entry in the CAM. That scheme consumes half of the bits in the CAM to store masks, and in practice this tremendous flexibility is mostly unused. Generally all CAM entries use one of a relatively small set of masks for various prefix lengths and ACL configurations. Some recent CAM designs instead supply a generous pool of global mask registers, and have each entry to reference one of the global masks. This leaves more of the CAM bits available to store keys.


 
Longest Prefix Match

Routers frequently have multiple, overlapping routes for a destination. The router is supposed to choose the most specific route, meaning the one with the longest prefix. For example if 10.0.0.0/26 and 10.0.0.0/8 both match, it should choose the /26. CAMs handle this using ternary masking to ignore the most specific bits of the destination IP. The most specific routes are placed earlier in the CAM, which takes the first match.

It is possible to handle IP prefix matching in a hash engine by spending multiple lookup cycles. Each lookup masks off one more bit of the IP address before computing the hash function, and thus searches for progressively less specific routes until it finds one.

lookupmask
1. 255.255.255.255
2. 255.255.255.254
3. 255.255.255.252
4. 255.255.255.248
5. 255.255.255.240
6. 255.255.255.224
... etc, etc ...
31. 128.0.0.0

Unfortunately the switch is unlikely to have a cycle budget sufficient for so many lookups. Fortunately, there is a further optimization.


 
LPM: Make the Target Bigger

/22 route expands to four /24 routesWe can reduce the number of lookup cycles by trading off an increased number of entries in the table. The hardware will be designed to only search for a limited number of prefix lengths (the specific masks would be programmable). Software expands all entries of prefix lengths between those the hardware will lookup. In the sample diagram the hardware can search with a /24 mask, and it will find the /22 route because it has been expanded into four /24 entries. Any incoming IP address in the range which should match the /22 will find one of the four entries in the table.

/23 route overwrites two of the entries added by the /22This technique also has to deal with overlapping routes. When a more specific /23 route is added, it should overwrite two of the entries from the less specific /22 route. Addition and deletion of entries in the hash table is complex in this scheme, particularly because it has to be done while packets continue to flow through the device. Nonetheless its always preferable to tolerate complexity in software than to make the hardware handle it. A bug in production software can be fixed. A bug in production hardware turns into a software bug to work around it.

The prefix expansion doesn't need to be linear. It doesn't have to be every 2 bits, as implied above. Analyzing the distribution of prefix lengths in routing tables shows they tend to clump at various lengths, not be uniformly distributed. Hardware used in a campus switch would be set for the prefixes common in that environment. It might apply masks for prefix length 32 (for a local station), 24, 19, 12, and 8. The hardware only needs to budget for 5 lookups, and software expands any other prefixes to match.


 
Why CAMs for Prefixes?

With the ability to do longest prefix matching in an inexpensive hash engine, why use CAMs at all? There are two main downsides to using hashing for this:

  1. A CAM with N entries will hold N routes. Due to prefix expansion, its really hard to say how many routes a hash table with N entries will hold.
  2. IPv6. Only the upper 64 bits of an IPv6 addresses need to deal with prefix lengths, but thats a bunch more lookup cycles.

In the end, a switch which deals with a lot of prefix routing will use a CAM. CAMs are simply better at it. That same switch might still use hash engines for L2 lookups, for ACLs, and anything 5-tuple based. Edge switches which primarily deal with directly attached stations and only a few prefix routes can get by using their hash engine for everything.


The next article in this series will look to the future direction of networking.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Tuesday, June 7, 2011

The Care and Feeding of Hash Engines

This is the second of a series of articles discussing hash engines for network switch chips. The first installment posted yesterday.

For all their usefulness and versatility, CAMs have several downsides: they consume considerable power, and they are expensive for their capacity. In networking we've become so enamored of CAMs that in dreaming of how to better manage a network, CAMs are all we think about. "How do we get our entries into the CAM?"

This article focuses on the implementation details, strengths, and weaknesses of hash engines in networking hardware.


 
Hashing in Silicon

A hash table in an ASIC works very much like a hash table in software: an index computed from the key is used to offset into a memory array. Yet there are some key differences in implementation and optimization:

Switch ASIC showing large blocks of memory and logic, with hash function a tiny little sliverHash complexity: In software hash implementations we trade off the time to compute the hash function versus the benefit we get from better distribution. Hardware hash functions are far less sensitive to the complexity of the computation. The concern is only about the amount of chip area the function takes up, and even very complex hash functions are an insignificant fraction of modern chips. ASICs generally use very complex hash functions with good distribution properties. CRCs of various polynomials are common, and MD2 is also used.

Rehashing on collision: In software hash tables we estimate the number of entries we'll have and collision rate we'll tolerate, and size the table accordingly. If too many collisions are noted, we can resize the table to improve it. In hardware implementations no such resize option exists. The table has to get an acceptable collision rate at its fixed size. Thus, hardware designs tend to be very concerned about rehashing behavior after a collision.

Ameliorating this concern about collisions is the differing performance requirements for hardware. In software the performance expectation is "as fast as possible." Every CPU cycle spent on the hash is one less cycle available for something else. In hardware, the hash table needs to sustain the link rate with minimum sized packets, which translates to a number of lookups per second to achieve the desired functionality. There is no value in making the hardware handle more lookups than that. Available lookup cycles will be used to implement additional rehashing to better handle collisions.


 
ASIC Hash Implementation Tradeoffs

There are two common patterns for hardware hash implementation that I know of, optimized for the type of memory present.

hash table with multiple entries per bucket

Deep buckets: DRAM has relatively slow random access, but fast reads from sequential locations. ASICs using an attached DRAM (or RLDRAM) optimize by using rather deep buckets. An incoming key is hashed to an index, and each entry in the bucket checked in turn to take advantage of the fast sequential reads. If entries are deleted from the bucket they are marked as invalid, but the remaining entries do not need to be packed at the front of the bucket. The hardware will read the entire bucket every time.


wider hash table with one entry per bucketMany hash functions: The downside of large buckets is that they tend to be poorly occupied: an entire bucket must be allocated to hold one entry. If a smaller amount of very fast random access memory is available, like SRAM, large buckets are not a good match. Single entry buckets will be used instead.

In an ASIC its reasonable to compute many hash functions in parallel, either varying CRC polynomials or radically different algorithms. You just plop down more logic blocks, they tend to be small enough not to worry about. A hardware hash implementation with N cycles available can compute N hash functions for offsets, and look for its entry at every one of them.

When we implement multiple hash functions in software, we ensure that entries are stored at their earliest available hash locations. We want the software to stop searching when it finds an empty location, so as to avoid the expense of computing additional functions. This complicates deletions: when an entry is deleted, we need to know if any other table entries would have hashed into that earlier location and move them there. For hardware, this kind of optimization is not done. Hardware has the cycle budget to check all N locations, and there is no value in moving entries.


Failsafe: even with strategies to handle hash collisions, its still the kind of thing that keeps an ASIC designer up at night. What if every address in a customer network just happens to hit the same small set of buckets? What if the table thrashes itself to death and their network melts down? What if it happens to our most important customer? What then? I'll be fired and lose my house and ...

This kind of concern drives designers towards CAMs, which have no such issue with collisions.

CAM function implemented with XNOR and NORTherefore we look for ways to handle even ruinous collisions. A clever fallback option is to include a very small CAM as an adjunct to a hash-and-match design. A CAM on the order of a dozen entries can be synthesized in logic using register bits and gates. This is very inefficient compared with custom CAM silicon, but with such a small number of entries it is tolerable. If no match is found in the hash table, the ASIC will consult its on-chip CAM before giving up. Now you can take the unlikely case of collision in every candidate hash location, and handle twelve of them. Whatever the probability of such catastrophic collision, make it an order of magnitude less likely to cause a problem and it becomes less likely to keep an ASIC designer up at night.


The next post will focus on one of the weak points of hash engines: prefix matching.


footnote: in several places I've said "no value in X." This isn't exactly true, as power consumption would be reduced by minimizing the number of lookups or amount of work done. In most network products the power consumption is dominated by other components like the optics and/or DSPs in the PHYs. Adding complexity in the table lookups to minimize power draw is generally not a good tradeoff to make, and I assert there is no value in doing it.

footnote 2: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Monday, June 6, 2011

Hash Tables Versus CAMs

Nowadays when discussing switching and routing, we talk about CAMs (Content Addressable Memory). Yet CAMs came into widespread use in networking gear less than 10 years ago, and the Internet is older than that. Before CAMs, switch ASICs used hash-and-match engines to route packets.

This is the first in a series on articles focussing on hash-and-match for networking applications. Why write this? I believe we rely too much on CAMs in networking today, resulting in equipment which is more expensive than it needs to be. Hash engines can fill many roles with aplomb, yet we plop down a CAM and move on. Furthermore we're in danger of defining new protocols in such a way that effectively mandates a CAM, blocking the ability to use hash tables at all. That would be unfortunate.

This first post provides definitions for the discussion to follow.


Hash tables have been used more or less forever in software. To find data in the table a function is computed over the key. The hash function reduces the key down to a narrow index, which offsets into the table. There is a good description of hash tables at Wikipedia. Hash tables implemented in hardware are very similar to software, differing mainly in the optimizations applicable (which will be covered in a future article in this series).

Hash table performance is very sensitive to the quality of the hash function. Having multiple keys hash to the same index is called a collision. A bad hash function leads to many items colliding in the same small set of entries while leaving most of the table unoccupied.

MAC addresses hashing to various locations

Content Addressable Memory is a completely different type of memory. It differs from RAM in how data is fetched: for RAM you supply an address, for a CAM you supply a key to look for. The CAM returns any data previously associated with that key.

CAMs are often described as being the hardware equivalent of Associative Arrays from software, but I reject this comparison as more misleading than it is helpful. Associative arrays are a high level abstraction implemented on top of underlying datastructures; they are generally implemented using hash tables. Hardware CAMs are not implemented using hash tables. They are more akin to magic, where a key is presented and every entry in the CAM simultaneously compares itself to the input.


Ternary CAM adds the ability to selectively mask bits from the input key when searching, and are generally the type used in network devices. For example consider a routing table with increasingly specific routes for a destination, and how this will be handled in the CAM:

CAM entries for various prefix lengths

The CAM returns the first match it finds, so the most specific routes need to appear earlier. Consider an example where we look for 10.1.1.2. The CAM will apply the 0xFFFFFFFF mask of the first entry and find that it does not match 10.1.1.1. It will apply the second mask 0xFFFFFFF0 and find a match, so this result is returned. The rest of the entries actually are compared (everything happens in parallel inside the CAM) but their results are suppressed by prioritization logic.


The next article in this series will focus on hardware hash table implementation issues.


footnote: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.

Thursday, June 2, 2011

Jim Gettys Bufferbloat talk

On April 24th Jim Gettys gave a talk on Bufferbloat. Bufferbloat refers to large variations in throughput over the Internet, caused by the large amounts of buffering in queues at each hop.

Two videocameras recorded the event. One captured a good video track, but with audio badly garbled by echo and noise from the audience. The other captured a clear (albeit low) audio track, but with video resolution insufficient to make out the slides. Though both cameras recorded at 30 frames per second, they disagree over the definitions of "thirty" and "second." Over the course of an hour recording, one camera drifted 14 seconds longer than the other. It took some effort to splice the two together.

Without further ado, here it is.