Thursday, July 16, 2009

Courgette binary patch compression

Recently on the Chromium blog Google announced an improved binary compression algorithm called Courgette. In the example cited Courgette produced a patch that was only 11% of the size of that produced by bsdiff. The design overview has more details on its operation:

Courgette uses a primitive disassembler to find the internal pointers. The disassembler splits the program into three parts: a list of the internal pointer's target addresses, all the other bytes, and an 'instruction' sequence that determines how the plain bytes and the pointers need to be interleaved and adjusted to get back the original input. We call this an 'assembly language' because we can run an 'assembler' to process the instructions and emit a sequence of bytes to recover the original file.

The non-pointer part is about 80% of the size of the original program, and because it does not have any pointers mixed in, it tends to be well behaved, having a diff size that is in line with the changes in the source code. Simply converting the program into the assembly language form makes the diff produced by bsdiff about 30% smaller.

I haven't checked, but I suspect its disassembler supports x86 only. Chromium runs on Windows, MacOS X, and Linux, which all run primarily on x86 systems.

Courgette is of course aimed at updates to Google's Chrome browser, which is installed in very large numbers and frequently updated. Reducing the size of the updates results in a better user experience. Nifty.

Incremental Patching and Embedded Systems

When first posted, this article launched directly from Courgette into a discussion of incremental patching in embedded systems. In the comments Wayne Scott pointed out that this really wasn't fair: Courgette is purely a way to make binary patches smaller. In fact because Courgette requires the complete original binary in order to generate its diffs, it cannot be used to generate independent incremental patches at all. After clarifying my thinking, I've updated the post and added a bit of segue text.

Let us now turn to the more general subject of patching of embedded systems. Whenever there is a problem in the field, there is a strong temptation to push out a fix as rapidly as possible. Whether called a point patch or a hotfix, the basic idea is to patch just the portion of the software causing issues for that customer. Larger, periodic maintenance releases collect all existing hotfixes (plus additional ongoing maintenance work) into a single release suitable for all deployments. For embedded systems, on general principles I don't favor the use of hotfixes. Though it reduces the bandwidth required for updates, I feel the disadvantages outweigh the advantages:

  1. perhaps obviously, you need management software to apply, remove, and list installed patches
  2. tech support has a larger number of variations to deal with
  3. test load increases rather dramatically. If you have 5 independent patches you may need to test the combinations, up to 2^5=32 variations to test, not just 5.
  4. Frequent updates are not a good thing for most embedded systems. Customers want the gear to fade into the background and just work, making them update and reboot too often becomes a distinct negative.
  5. As described in an earlier article I favor storing the boot images in a raw flash partition, not any sort of filesystem, which would make installation of an incremental patch more complex.

I recommend not trying to maintain the most recent maintenance release plus an ever-growing collection of hotfixes. I suggest instead to revise the maintenance release whenever there is a cutomer problem. If other customers are not experiencing the problem then they need not deploy the new release right away. The main benefit is to avoid having 2^N possible combinations of patches in the field, instead having only N minor maintenance releases. Revving the maintenance release also tends to be treated with more care than a simple hotfix is; rushing the process is rarely beneficial.