Thursday, April 8, 2010

Simple Checksums Considered Harmful

Lets talk about iSCSI for a moment, as a launching point for a discussion about data integrity. iSCSI relies on CRC32 to catch data corruption. CRC32 is a good fit for this purpose, but most previous uses of it had been confined to very low levels of the system and implemented in hardware. iSCSI uses CRC32 way up in the protocol header, where it is generally computed in software. The overhead of computing the CRC is one reason why so many hardware offload adaptors were developed for iSCSI.

Intel recently released a whitepaper describing how they achieved 1 million iSCSI operations per second. One fascinating tidbit is that the CRC32 is no longer a bottleneck. The Nehalem architecture includes an instruction to compute it directly, as part of SSE 4.2. The new instruction is described in the Intel64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M. It is on page 3-221 of the December 2009 edition; search for CRC32 in later editions.

CRC32 r32, r/m8Accumulate CRC32 on r/m8
CRC32 r32, r/m16Accumulate CRC32 on r/m16
CRC32 r32, r/m32Accumulate CRC32 on r/m32
CRC32 r64, r/m8Accumulate CRC32 on r/m8
CRC32 r64, r/m64Accumulate CRC32 on r/m64

Thats it. You load words from memory and hand them to the CRC32 instruction. If you were already making a pass over the data for any reason, the CRC calculation is free. Table-driven CRC generation implementations were already fast, but this is even faster.

What does this mean? I think it means weak checksums should no longer be used for anything. Applications which care about data integrity moved to MD5 or SHA1 years ago, but you still see specifications in other contexts written to use Adler-32 or even the venerable 16-bit TCP checksum. Its not appropriate to use these any more. Server CPUs can compute CRC32 for free, and embedded CPUs have long included CRC32 calculation in DMA engines.