Tuesday, June 7, 2011

The Care and Feeding of Hash Engines

This is the second of a series of articles discussing hash engines for network switch chips. The first installment posted yesterday.

For all their usefulness and versatility, CAMs have several downsides: they consume considerable power, and they are expensive for their capacity. In networking we've become so enamored of CAMs that in dreaming of how to better manage a network, CAMs are all we think about. "How do we get our entries into the CAM?"

This article focuses on the implementation details, strengths, and weaknesses of hash engines in networking hardware.

Hashing in Silicon

A hash table in an ASIC works very much like a hash table in software: an index computed from the key is used to offset into a memory array. Yet there are some key differences in implementation and optimization:

Switch ASIC showing large blocks of memory and logic, with hash function a tiny little sliverHash complexity: In software hash implementations we trade off the time to compute the hash function versus the benefit we get from better distribution. Hardware hash functions are far less sensitive to the complexity of the computation. The concern is only about the amount of chip area the function takes up, and even very complex hash functions are an insignificant fraction of modern chips. ASICs generally use very complex hash functions with good distribution properties. CRCs of various polynomials are common, and MD2 is also used.

Rehashing on collision: In software hash tables we estimate the number of entries we'll have and collision rate we'll tolerate, and size the table accordingly. If too many collisions are noted, we can resize the table to improve it. In hardware implementations no such resize option exists. The table has to get an acceptable collision rate at its fixed size. Thus, hardware designs tend to be very concerned about rehashing behavior after a collision.

Ameliorating this concern about collisions is the differing performance requirements for hardware. In software the performance expectation is "as fast as possible." Every CPU cycle spent on the hash is one less cycle available for something else. In hardware, the hash table needs to sustain the link rate with minimum sized packets, which translates to a number of lookups per second to achieve the desired functionality. There is no value in making the hardware handle more lookups than that. Available lookup cycles will be used to implement additional rehashing to better handle collisions.

ASIC Hash Implementation Tradeoffs

There are two common patterns for hardware hash implementation that I know of, optimized for the type of memory present.

hash table with multiple entries per bucket

Deep buckets: DRAM has relatively slow random access, but fast reads from sequential locations. ASICs using an attached DRAM (or RLDRAM) optimize by using rather deep buckets. An incoming key is hashed to an index, and each entry in the bucket checked in turn to take advantage of the fast sequential reads. If entries are deleted from the bucket they are marked as invalid, but the remaining entries do not need to be packed at the front of the bucket. The hardware will read the entire bucket every time.

wider hash table with one entry per bucketMany hash functions: The downside of large buckets is that they tend to be poorly occupied: an entire bucket must be allocated to hold one entry. If a smaller amount of very fast random access memory is available, like SRAM, large buckets are not a good match. Single entry buckets will be used instead.

In an ASIC its reasonable to compute many hash functions in parallel, either varying CRC polynomials or radically different algorithms. You just plop down more logic blocks, they tend to be small enough not to worry about. A hardware hash implementation with N cycles available can compute N hash functions for offsets, and look for its entry at every one of them.

When we implement multiple hash functions in software, we ensure that entries are stored at their earliest available hash locations. We want the software to stop searching when it finds an empty location, so as to avoid the expense of computing additional functions. This complicates deletions: when an entry is deleted, we need to know if any other table entries would have hashed into that earlier location and move them there. For hardware, this kind of optimization is not done. Hardware has the cycle budget to check all N locations, and there is no value in moving entries.

Failsafe: even with strategies to handle hash collisions, its still the kind of thing that keeps an ASIC designer up at night. What if every address in a customer network just happens to hit the same small set of buckets? What if the table thrashes itself to death and their network melts down? What if it happens to our most important customer? What then? I'll be fired and lose my house and ...

This kind of concern drives designers towards CAMs, which have no such issue with collisions.

CAM function implemented with XNOR and NORTherefore we look for ways to handle even ruinous collisions. A clever fallback option is to include a very small CAM as an adjunct to a hash-and-match design. A CAM on the order of a dozen entries can be synthesized in logic using register bits and gates. This is very inefficient compared with custom CAM silicon, but with such a small number of entries it is tolerable. If no match is found in the hash table, the ASIC will consult its on-chip CAM before giving up. Now you can take the unlikely case of collision in every candidate hash location, and handle twelve of them. Whatever the probability of such catastrophic collision, make it an order of magnitude less likely to cause a problem and it becomes less likely to keep an ASIC designer up at night.

The next post will focus on one of the weak points of hash engines: prefix matching.

footnote: in several places I've said "no value in X." This isn't exactly true, as power consumption would be reduced by minimizing the number of lookups or amount of work done. In most network products the power consumption is dominated by other components like the optics and/or DSPs in the PHYs. Adding complexity in the table lookups to minimize power draw is generally not a good tradeoff to make, and I assert there is no value in doing it.

footnote 2: this blog contains articles on a range of topics. If you want more posts like this, I suggest the Ethernet label.