Sunday, March 23, 2008

Embedded + CPU != x86

I work on embedded systems, generally specialty networking gear like switches, routers, and security products. We use a commodity processor to handle administration tasks like web-based management, a command line interface, SNMP, etc. This processor runs an actual OS, which up to about six years ago would have been a commercial realtime package like vxWorks or QNX. In the last few years everything I've worked on has used Linux.

One question which comes up with reasonable frequency is why do embedded systems generally use RISC CPUs like PowerPC, MIPS, and ARM? Usually the question is phrased the other way around, "Why don't you just use x86?" To be clear: the processors used in the embedded systems I'm talking about are not the ultra-cheap 4, 8 or 16 bit CPUs you might think of when someone says "embedded". They are 32 or 64 bit CPUs, with memory protection and other features common to desktop CPUs. Asking why we're not using x86 is a legitimate question.

I'll cover four areas which push embedded systems towards non-x86 CPUs.

  • Volume Discounts
  • System on Chip (SoC)
  • Production lifetime
  • Power and heat

 
Volume Discounts

Lets not kid ourselves: pricing is by far the most important factor in component selection. If the pricing worked out, any other issues with using an x86 processor would either find a resolution or just be accepted as is. Common embedded processors in the market segment I'm talking about range from about $20 at the low end (various ARM9 chips, Freescale PowerPC 82xx) to $500 or more at the high end (Broadcom SB-1 MIPS, PA Semi PowerPC). More importantly, these prices are available at relatively low volumes: in the hundreds to thousands of units, rather than the 50,000+ unit pricing thresholds common in the x86 market. [Actually the pricing in the x86 market is far more complex than volume thresholds, but that quickly gets us into antitrust territory where my head starts to hurt.]

This leads us to one of the vagaries of the embedded market: it is enormous, but very fragmented. For example in 2007 Gartner estimates that 264 million PCs were sold, nearly all of which contain an x86 processor. In the same year almost 3 billion ARM processors were sold. To be sure, the average selling price of the ARM CPU would have been considerably less than the average x86 CPU, but it is still an enormous market for chip makers.

Yet while the top 6 PC manufacturers consume 50% of the x86 CPUs sold, the embedded market is considerably more fragmented. For every product like the Blackberry or Linksys/Netgear/etc routers selling in the millions of units, there are hundreds of designs selling in far lower volumes. In the embedded market the pricing curve at the tail end is considerably flatter: you can get reasonable prices even at moderate volumes. As a practical matter this means the chip maker's pricing also has to factor in several layers of distributors between themselves and the end customer.


 
System on Chip

Once we get past the pricing and volume discount models, the biggest differences between an embedded CPU and an x86 CPU designed for PCs are in the peripheral support. Consider the block diagram shown here: the CPU attaches to its Northbridge via a Front-Side Bus (*note1). The Northbridge handles the high bandwidth interfaces like PCIe and main memory. Lower speed interfaces are handled by the Southbridge, like integrated ethernet and USB. Legacy ports like RS-232 have mostly been removed from current Southbridge designs; an additional SuperIO component would be required to support them. Modern PCs have dispensed with these ports, but a serial port for the console is an expected part of most embedded networking gear.

I'll hasten to add that this arrangement makes a great deal of sense for the PC market, given its design cycles. CPU designs change relatively frequently: there will be at least a clock speed increase every few months, and a new core design every couple years. Memory technologies change somewhat more slowly, requiring a new Northbridge design every few years. The lower-speed interfaces handled by the Southbridge change very slowly, Southbridge chips can stay on the market for years without modification. Segregating functionality into different chips makes it easier to iterate the design of some chips without having to open up all of the unrelated functionality.

Requiring a Northbirdge, Southbridge, and additional supporting chips adds cost in two ways: the cost of the additional chips, and the additional board space they require. The primary functionality of these products is not the CPU: the CPU is just there for housekeeping really, the primary functionality is provided by other components. Taking up more space for the CPU complex leaves less space for everything else, or requires a more expensive board with denser wiring to fit everything.

(*note1): AMD incorporates the memory controller into the CPU, so DRAM would be connected directly. Future Intel CPUs will do the same thing. They do this to reduce latency to DRAM and therefore enhance performance. A Northbridge chip will still be required to provide I/O and hook to the Southbridge.

By contrast, CPUs aimed at the embedded market are System on Chip designs. They contain direct interfaces for the I/O ports and buses typically required for their application, and require minimal support chips. The CPUs I work with include a MIPS or PowerPC CPU core running at a modest frequency of 1GHz or below. The DRAM controller, PCI bus, and all needed ports are part of the CPU chip. We strap Flash and memory to it, and off we go.

There are a vast number of embedded CPUs tailored for specific industries, with a different set of integrated peripherals. For example, automotive designs make heavy use of an interface called the Controller Area Network. Freescale Semiconductor makes a MPC5xx line of PowerPC CPUs with integrated CAN controllers, a healthy amount of Flash memory directly on the CPU, and other features tailored specifically for automotive designs. Other markets get similar treatment: the massive mobile phone market has a large number of CPUs tailored for its needs.


 
Production lifetime

Its relatively easy for a processor vendor to produce versions of a given CPU at different frequencies: they simply "speed bin" CPUs coming off the line according to the maximum frequency they can handle. CPUs which run faster are sold at higher prices, with a reduced price for lower frequencies. At some point the pricing curve becomes completely artificial, for example when essentially all CPUs coming off the line will run at 1.5 GHz but a 1.0 GHz speed grade is sold because there is sufficient demand for it.

It becomes more difficult for the processor vendor when they switch to a new core design. Maintaining a production line for the older design means one fewer production line for the newer design, which undoubtedly sells at a better profit margin. The vendor will pressure their customers to move off of the oldest designs, allowing them to be End of Lifed (EOL).

PC vendors typically revise their system designs relatively frequently, simply to remain competitive. You might have bought an iMac in 2005 and another iMac in 2007, but if you open them up they won't be the same. By contrast embedded system designs are often not revised at all: once introduced to the market, the exact same product will remain on the price list for years. Newer products with an improved feature set will be added, but if a customer has qualified one particular product they can continue buying that exact same product year after year.

In the embedded world a minimum 5 year production lifetime is standard. The design lifetime to EOL varies widely in the x86 world. An industrial version of Intel's 486 remained in production until 2006, but the Core Solo was only produced for about 18 months.


 
Power and heat

The type of systems I work on are relatively compact, using a 1U rackmount chassis. There are a number of other power-hungry chips in the system in addition to the CPU, and we have to reserve sufficient power to handle a number of PoE ports. All of this power consumption demands airflow to keep it all cool, so we'll have a line of fans blowing a sheet of air over the entire board.

This means we don't have much room in the power supply or cooling capacity for a hot CPU. The power-hungry Pentium 4 was simply unthinkable, and the typical 65 watt Thermal Design Power (TDP) of current x86 CPUs is still too much. We look for something closer to 8 watts. Indeed, none of the systems I've worked on has ever had a CPU fan, just a heat sink.


 
Future Developments

AMD and Intel have both dabbled with embedded derivatives of x86 chips over the years. Intel's 80186 was an 8086 CPU with additional peripherals and logic, making it into a System on Chip for the embedded market of that time (I bet you thought they skipped from 8086 directly to 80286). Intel also produced various industrial versions of the 386, 486, Pentium, and Core CPUs. AMD produced embedded versions of the 486 and K5, branding them "Elan." These CPUs have experienced various degrees of success, but mostly the chip vendors have seemed to treat the embedded market as a place to dump the hand-me-downs from last years PCs. If your design really does call for something which looks a lot like a PC you might use these chips, but otherwise the various ARM/MIPS/PowerPC SoC products would be a better fit.

Starting in 2008 Intel began a renewed focus on the embedded market. The Tolopai design is the most interesting for the kind of products I work on: a highly integrated SoC with multiple gigabit ethernet MACs and no integrated graphics. I can definitely see a Tolopai-like processor being used in the sort of products I work on, if the middle volume pricing is competitive. We'll have some endianness issues to work through: we write a lot of code in C and the processors used now are mostly big-endian. Though we try to handle the endianness correctly, there are undoubtedly places where we've slipped up.

Sunday, February 24, 2008

Premature Optimization for Fun and Profit

A google search for "premature optimization" turns up tens of thousands of hits, most of them quoting Hoare's maxim that premature optimization is the root of all evil.

However I would argue there is one optimization which, if it is to be done at all, should be enabled at the very start of the project. It is the most simple optimization imaginable: invoke the C compiler with optimization enabled. It might seem obvious to do so, but I've worked on more than one project which made the deliberate decision to start with -O0 to ease debugging and speed development. The intention is always to switch on optimization when the project is approaching release, but it rarely works out that way.

To be certain, there are sound reasons to use -O0 when developing the system. Optimized code is harder to debug: the line numbers in the debugger will not always be correct, variables you want to examine may have been optimized away, etc. However, it turns out to be exceedingly difficult to turn on compiler optimization for a software project which has been under development for a while. Code compiled -O2 has a number of differences compared to -O0, and there are several classes of bugs which will never be noticed until the day compiler optimization is enabled. I'll detail several of the more common such problems I've encountered.


Optimization Tidbit #1: Stack space
char buf[16];
int a, b, c;

a = 0;
b = a + 1;
c = b + 2;

... a and b never used again...

 
Stack frames will be larger when compiled -O0. All declared variables will be allocated stack space in which to store them. When compiled -O2 the compiler looks for opportunities to keep intermediate values in registers, and if a value never needs to be stored to the stack then no space will be allocated for it.

What impact does this have? Because a good portion of the stack frame consists of "useless" space in the unoptimized build, a stack-related bug in the code is far less likely to be noticed as it will not cause any visible failure. For example:


 
char buf[16];
int a, b, c;

strncpy(buf, some_string, 20);

Because the optimized stack frame only contains space for things which are actually used, corruption of the stack is almost guaranteed to cause a failure. I'm not arguing this is a bad thing: walking off the end of a buffer on the stack is an extraordinarily serious coding problem which should be fixed immediately. I'm arguing a somewhat more subtle point: enabling optimization late in the development process will suddenly expose bugs which have been present all along, unnoticed because they caused no visible failures.


 
Optimization Tidbit #2: Stack initialization

When compiled -O0, all automatic variables are automatically initialized to zero. This does not come for free: the compiler emits instructions at the start of each function to laboriously zero it out. When compiled -O2 the stack is left uninitialized for performance reasons, and will contain whatever garbage happens to be there.

I wrote this article based on notes from debugging a large software product at a previous employer, where we transitioned to -O2 after shipping -O0 for several years. One of the routines I disassembled in that debugging effort contained a series of store word instructions, and I jumped to the conclusion that -O0 was deliberately zeroing the stack. I've had that in my head for several years now. However as has been pointed out in the comments of this article, my conclusion was incorrect. The instructions I was poring over at that time must have been for something else; I'll never know exactly what.

Part of the reason for starting this blog was to try to learn more about my craft. I freely admit that the things I don't know can (and do) fill many college textbooks. I'll thank Wei Hu for gently setting me straight on this topic, in the comments for this article. I've deleted the rest of the incorrect details of this Tidbit #2; I'll leave the first paragraph as a reminder.


Example

I worked on a product which had fallen into this trap, and had shipped to customers for a year and a half (through three point releases) still compiled -O0. It became increasingly ridiculous to try to improve performance weaknesses when we weren't even using compiler optimization, so we drew the proverbial line in the sand to make it happen. The most difficult problem to track down was in a module which, sometimes, would simply fail all tests.

int start_module() {
    int enable;

    ... other init code ...
    if (enable) activate_module();
}

It had a variable on the stack of whether to enable itself. This variable was not explicitly initialized, but when compiled -O0 one of the routines the main function called was walking off the end of its stack and scribbling a constant, non-zero value over enable. Thus the module would enable itself, quite reliably.

When compiled -O2, that other routine ended up scribbling on some something else and leaving enable alone. Thus, whether the module enabled itself depended entirely on the sequence of operations leading up to its activation and what precise garbage was on the stack. Most of the time the garbage would be non-zero, but occasionally just due to traffic pattern and random chance we'd be left with enable=0 and the module would fail all tests.

The real problem in this example is the routine which scribbled off its stack frame, but the effect of that bug was to make new regressions appear when -O2 was enabled. The longer a code base is developed, the more difficult it is to switch on optimization.


What does it mean?

Planning to do most of the development with -O0 and switch on optimization before release results in suddenly exposing bugs hidden throughout the code base, even in portions of the code which were considered stable and fully debugged. This will result in significant regression, and in the crunch time approaching a release the most likely response will be to turn optimization back off and plan to deal with it "in the next release." Unfortunately the next release will also include new feature development, fixing customer escalations, etc, and going back to turn on optimization may have to wait even longer. The more code is written, the harder it will be.

My recommendation is to do exactly the opposite: you should enable -O2 (or -Os) from the very start of the project. If you have trouble debugging a particular problem see if it can be reproduced on a test build compiled -O0, and debug it there. Unoptimized builds should be the exception, not the norm.

Introduction

I've developed embedded systems software for a number of years, mostly networking products like switches and routers. This is not a common development environment for most of the software blogs I've seen, so I'm writing this to show how the other half lives.

Years ago these type of systems would run a realtime OS like vxWorks, if they used an OS at all. Nowadays they frequently run Linux. Programming for an embedded target, even one running Linux, is a bit different from writing a web service or desktop application:
  • The target is likely using an embedded processor like PowerPC, MIPS, or ARM. Everything is cross-compiled.
  • The target may have a reasonable amount of RAM, but it will probably not have any swap space. Memory footprint is an issue.
  • The target is likely using flash for its filesystem. Image size is an issue.
This leads to a different set of trade-offs for development. One has to ask questions before pulling in a library or framework: can it be cross compiled? Is the binary small enough to fit?

In this blog I don't expect to say much about Ruby on Rails, web services, or functional languages of any description. This is a blog about the dirty business of building systems. I hope you enjoy it.