Saturday, October 21, 2017

On CPE Release Processes

Datacenter software is deployed frequently. Push daily! Push hourly! Push on green whenever the tests pass! This works even at extremely large scale, new versions of are deployed multiple times each day (much of the site functionality is packaged in a single deployable unit).

CPE device software tends to not be deployed so often, not even close. There are several reasons for this:

  • Test practices are different.

    Embedded systems is one of the oldest niches in software development and does not have a strong tradition even of unit testing, let alone the level of automated testing which makes concepts like push-on-green possible. One can definitely get good unit test coverage of code which the team developed, but the typical system will include a much larger amount of open source code which rarely has unit tests and is daunting for the team to try to add tests to. Much of the code in the system is only going to be tested at the system level. With effort and effective incentives one can develop a level of automated system test coverage... but it still won’t be close to 95%. System level testing never is, the combinatorial complexity is too high.

    Additionally, with datacenter software, the build system creating the release is often somewhat similar to the production system which will run the release. It may even be the same, if the development team uses prod systems to farm out builds. A reasonable fraction of the system functionality can be run in tests on the builder.

    With CPE devices, the build system is almost always not a CPE being tasked to compile everything. The build system is an x86 server with a cross-compiler. The build system will likely lack much of the hardware which is key to the CPE device functionality, like network interfaces or DRM keystores or video decoders. Large portions of the system may not be testable on the builder.

  • The scale is different.

    Having a million servers in datacenters is a lot, that is one or more very large computing facilities capable of serving hundreds of millions of customers.

    Having a million CPE devices is not a lot. There are typically multiple devices within the home (modem, router, maybe some set top boxes), so that is a couple hundred thousand customers.

    It can simply take longer to push that amount of software to the much larger number of systems whose network connections will generally be slower than those within the datacenter. Multiple days is typical.

  • The impact of a problem in deployment is different.

    If you have a serious latent bug which is noticed at the 3% point of a rollout within a datacenter, that is probably a survivable event. Customers may be impacted and notice, but you can generally quarantine those 3% of servers from further traffic to end the problem. The servers can be rolled back and restored to service later, even if remediation steps are required, without further impacting customers.

    If you have serious latent bug which is noticed at the 3% point of a rollout within a CPE Fleet, you now have a crisis. 3% of the customer base is impacted by a serious bug, and will feel the impact until you finish all of the remediation steps.

    If the remediation steps in 3% of a datacenter rollout require manual intervention, that will be a significant cost. If the remediation steps in 3% of a CPE Fleet deployment require manual intervention, it will have a material impact on the business.

We’ll jump straight to the punchline: How often should one deploy software updates to a CPE fleet?

In my opinion: exactly as often as it takes to not feel terrified at the prospect of the next release, no more and no less often than that.

  • Releasing infrequently allows requirements and new development to build up, making the release heavier and with more opportunities for accidental breakage. It also results in developer displeasure at having to wait so long for their work to make it to customers, and corresponding rush to get not-quite-baked features in to avoid missing the release.
  • Releasing too frequently can leave not enough time to fully test a release. Though frequent releases have the advantage of having a much smaller set of changes in each, there does still need to be a reasonable confidence in testing.

In the last CPE fleet I was involved in, we tried a number of different cadences: every 6 weeks, then weekly, then quarterly. I believe the 6 week cadence worked best. The weekly cadence resulted in a number of bugs being pushed to the fleet and subsequent rollbacks simply due to the lack of time to test. The quarterly cadence led to developers engaging in bad behavior to avoid missing a release train, by submitting their feature even in terrible shape. The release cadence became even slower, and the quality of the product noticeably lower. I think six weeks was a reasonable compromise, and left enough headroom to do minor releases at the halfway point as needed where a very small number of changes which were already tested for the next release could be delivered to customers early.

One other bit of advice: no matter what the release cadence is, once it has been going on long enough, developers will begin griping about it and the leadership may begin to question it (Maxim #4). Leadership interference is what led to the widely varying release processes in the last CPE fleet I was involved in. My only advice there is to manage upwards: announce every release, and copy your management, to keep it fresh in their minds that the process works and delivers updates regularly.