Coding Relic: CPU

Showing posts with label CPU. Show all posts

Monday, August 15, 2011

An Awkward Segue to CPU Caching

Last week Andy Firth published The Demise of the Low Level Programmer, expressing dismay over the lack of low level systems knowledge displayed by younger engineers in the console game programming field. Andy's particular concerns deal with proper use of floating versus fixed point numbers, CPU cache behavior and branch prediction, bit manipulation, etc.

I have to admit a certain sympathy for this position. I've focussed on low level issues for much of my career. As I'm not in the games space, the specific topics I would offer differ somewhat: cache coherency with I/O, and page coloring, for example. Nonetheless, I feel a certain solidarity.

Yet I don't recall those topics being taught in school. I had classes which covered operating systems and virtual memory, but distinctly remember being shocked at the complications the first time I encountered a system which mandated page coloring. Similarly though I had a class on assembly programming, by the time I actually needed to work at that level I had to learn new instruction sets and many techniques.

In my experience at least, schools never did teach such topics. This stuff is learned by doing, as part of a project or on the job. The difference now is that fewer programmers are learning it. Its not because programmers are getting worse. I interview a lot of young engineers, their caliber is as high as I have ever experienced. It is simply that computing has grown a great deal in 20 years, there are a lot more topics available to learn, and frankly the cutting edge stuff has moved on. Even in the gaming space which spurred Andy's original article, big chunks of the market have been completely transformed. Twenty years ago casual gaming meant Game Boy, an environment so constrained that heroic optimization efforts were required. Now casual gaming means web based games on social networks. The relevant skill set has changed.

I'm sure Andy Firth is aware of the changes in the industry. Its simply that we have a tendency to assume that markets where there is a lot of money being made will inevitably attract new engineers, and so there should be a steady supply of new low level programmers for consoles. Unfortunately I don't believe that is true. Markets which are making lots of money don't attract young engineers. Markets which are perceived to be growing do, and other parts of the gaming market are perceived to be growing faster.

Page Coloring

Least significant bits as cache line offset, next few bits as cache index Because I brought it up earlier, we'll conclude with a discussion of page coloring. I am not satisfied with the Wikipedia page, which seems to have paraphrased a FreeBSD writeup describing page coloring as a performance issue. In some CPUs, albeit not current mainstream CPUs, coloring isn't just a performance issue. It is essential for correct operation.

Cache Index

Least significant bits as cache line offset, next few bits as cache index Before fetching a value from memory the CPU consults its cache. The least significant bits of the desired address are an offset into the cache line, generally 4, 5, or 6 bits for a 16/32/64 byte cache line.

The next few bits of the address are an index to select the cache line. It the cache has 1024 entries, then ten bits would be used as the index. Things get a bit more complicated here due to set associativity, which lets entries occupy several different locations to improve utilization. A two way set associative cache of 1024 entries would take 9 bits from the address and then check two possible locations. A four way set associative cache would use 8 bits. Etc.

Page tag

Least significant bits as page offset, upper bits as page tag Separately, the CPU defines a page size for the virtual memory system. 4 and 8 Kilobytes are common. The least significant bits of the address are the offset within the page, 12 or 13 bits for 4 or 8 K respectively. The most significant bits are a page number, used by the CPU cache as a tag. The hardware fetches the tag of the selected cache lines to check against the upper bits of the desired address. If they match, it is a cache hit and no access to DRAM is needed.

To reiterate: the tag is not the remaining bits of the address above the index and offset. The bits to be used for the tag are determined by the page size, and not directly tied to the details of the CPU cache indexing.

Virtual versus Physical

In the initial few stages of processing the load instruction the CPU has only the virtual address of the desired memory location. It will look up the virtual address in its TLB to get the physical address, but using the virtual address to access the cache is a performance win: the cache lookup can start earlier in the CPU pipeline. Its especially advantageous to use the virtual address for the cache index, as that processing happens earlier.

The tag is almost always taken from the physical address. Virtual tagging complicates shared memory across processes: the same physical page would have to be mapped at the same virtual address in all processes. That is an essentially impossible requirement to put on a VM system. Tag comparison happens later in the CPU pipeline, when the physical address will likely be available anyway, so it is (almost) universally taken from the physical address.

This is where page coloring comes into the picture.

Virtually Indexed, Physically Tagged

From everything described above, the size of the page tag is independent of the size of the cache index and offset. They are separate decisions, and frankly the page size is generally mandated. It is kept the same for all CPUs in a given architectural family even as they vary their cache implementations.

Consider then, the impact of a series of design choices:

32 bit CPU architecture
64 byte cache line: 6 bits of cache line offset
8K page size: 19 bits of page tag, 13 bits of page offset
512 entries in the L1 cache, direct mapped. 9 bits of cache index.
virtual indexing, for a shorter CPU pipeline. Physical tagging.
write back

Virtually indexed, physically tagged, with 2 bits of page color

What does this mean? It means the lower 15 bits of the virtual address and the upper 19 bits of the physical address are referenced while looking up items in the cache. Two of the bits overlap between the virtual and physical addresses. Those two bits are the page color. For proper operation, this CPU requires that all processes which map in a particular page do so at the same color. Though in theory the page could be any color so long as all mappings are the same, in practice the virtual color bits are set the same as the underlying physical page.

The impact of not enforcing page coloring is dire. A write in one process will be stored in one cache line, a read from another process will access a different cache line.

Page coloring like this places quite a burden on the VM system, and one which would be difficult to retrofit into an existing VM implementation. OS developers would push back against new CPUs which proposed to require coloring, and you used to see CPU designs making fairly odd tradeoffs in their L1 cache because of it. HP PA-RISC used a very small (but extremely fast) L1 cache. I think they did this to use direct mapped virtual indexing without needing page coloring. There were CPUs with really insane levels of set associativity in the L1 cache, 8 way or even 16 way. This reduced the number of index bits to the point where a virtual index wouldn't require coloring.

Thursday, May 5, 2011

Intel 22nm and Mobile Computing

Yesterday Intel announced details of their 22nm silicon process. There were a number of fascinating details on how the transistors are made, but one graph from the presentation presages what will happen in the mobile market over the next several years.

As the voltage goes to zero, the consumed current goes to zero. It sounds obvious, but really isn't. Even when nominally "off," transistors have always leaked current into the substrate. As silicon features have gotten smaller the power they consume while active has declined rapidly, but the leakage current less rapidly.

Other graphs in the presentation show a tradeoff between operating voltage and leakage current, which means power consumed while active versus power consumed at all times. Intel's production chips will likely tolerate a little leakage to get lower voltage, but still very low.

In 32nm silicon processes leakage current may already be the primary factor in power consumption. It is difficult to estimate how serious the effect is, but this article from March 2008 shows leakage current as relatively insignificant in 180 nm silicon but growing to nearly 40% of total power consumption in a 50 nm process. We're at 32nm now.

Except Intel just changed the game.

ARM has several advantages in the mobile space. Their products are available from many manufacturers and their support in software toolchains is nearly universal, but their biggest advantage has been low power consumption compared to x86 or other architectures. ARM did a great job designing chips which are very sparing of the power they consume while operating.

Except Intel just changed the game.

Intel now has a silicon process with radically lower leakage current. x86 consumes more power while actively operating, but leakage current is more significant. ARM's competitive advantage has shrunk substantially. Expect to see a lot more x86 CPUs in mobile devices, starting in late 2011.

Thursday, January 13, 2011

Microsoft and ARM

In July 2010 Microsoft signed an architecture license with ARM. In January 2011 Microsoft announced that Windows 8 will run on ARM CPUs. So the license was purchased to support the Windows development effort, right?

I really don't think so.

Porting Windows to a new processor architecture is a massive undertaking. To its credit, Microsoft has maintained the discipline to keep the OS from becoming too entangled with the underlying platform. At various points in its history Windows NT has run on Alpha, MIPS, PowerPC, and Itanium. All but Itanium are long discontinued, and given the enormous codebase and inertia adding a new instruction set would take quite a bit of effort. If Microsoft needed an architectural license to proceed, they needed it more than 6 months before a public demonstration of the result.

Additionally an architectural license is not required to port software to ARM, not even for software as extensive as NT. ARM will provide the necessary support via other, less spectacular arrangements. An architectural license allows the licensor to develop their own implementation of the instruction set, either completely independently or as a substantial modification to a core supplied by ARM. Most producers of ARM chips don't have an architectural license; they don't need one to add peripherals and coprocessors around an unmodified ARM core.

Microsoft is engaging with ARM on multiple fronts, and as it involves Windows (and Office) it would be at the CxO level. I think the ARM license is for Xbox, not for Windows Mobile and not for the NT port. In other products Microsoft relies on hardware partners, which would seriously complicate an effort to introduce a custom CPU. Xbox is the one place where Microsoft produces its own platform in volumes large enough to warrant custom ASIC development; they rely on contract manufacturers to build it, but the design and finished product is unique to Microsoft.

Developing a customized ARM processor isn't easy, but it isn't unapproachably difficult either. The current Xbox relies on a PowerPC processor from IBM, but PPC is increasingly being relegated to the very low and very high ends of the market. Embedded controllers don't have the needed processing power, while supercomputer CPUs are too expensive and too hot. Xbox has already changed CPUs once, from an x86 in the original to the current PPC. Microsoft has to be weighing the alternatives of completely funding a suitable PowerPC core design, or switching to a different architecture with more presence in the midrange. Nowadays that means x86 or ARM. I think they've chosen ARM, and previously speculated what a CPU designed specifically for Xbox might look like.

Monday, November 1, 2010

Intel and Achronix Get Engaged

Fake Intel x86 with FPGAs In January JP Morgan predicted that Intel would acquire an FPGA vendor in 2010. Speculation immediately focussed on Altera and Xilinx, which are large enough to have a material impact on Intel's sales. I wrote about it then, speculating that Intel would use the technology to get into various embedded market segments without needing a zillion SoC variants. Choose a die with appropriate I/O pins, load the logic into FPGA blocks alongside the CPU, and voila!

Yesterday the Wall Street Journal reported that Intel is opening their fabs to Achronix Semiconductor, a startup with interesting FPGA technology. The Achronix home page highlights what is presumably the immediate benefit to Intel, in unlocking additional sales to US military and intelligence agencies.

"The Achronix Speedster22i FPGA Platform uniquely enables applications that require an end-to-end supply chain within the United States. Being built at an onshore location offers significant advantages to programmable logic users who demand the highest level of security."

Presumably the agencies interested in using these parts want to embed optimized hardware to offload algorithms from software. This can be necessary for some applications, if the customer has the resources to implement it. The desire for an on-shore supply chain which can be audited is in reaction to the inadvertent use of counterfeit chips in previous military systems.

Achronix is using branding for the product line which looks remarkably like Intel's, and it seems certain the deal has provisions for cancellation or modification upon change of control to another party. This announcement also amounts to Intel marking their territory for an acquisition.

I/Os Considered Important

DoD requirements notwithstanding, there are relatively few applications where embedding algorithms in FPGAs makes sense. The drawback has never been a technological one, in requiring closer cooperation between CPU and FPGA. It is a business issue: once you commit to a specialized hardware design, the clock starts ticking. There will come a day when a software implementation could meet the requirements, and at that point the FPGA becomes an expensive liability in the BOM cost. You have to make enough profit from the hardware offload product to pay for its own design, plus a redesign in software, or the whole exercise turns out to be a waste of money.

There is another quote on the Achronix technology page which is quite relevant:

"Speedster FPGAs include four embedded DDR1/2/3 controllers, each offering up to 72 bits of data at 1066 Mbps. ... The DDR controllers are fully by-passable so the pins can be used as general I/O if the DDR controllers are not needed." (emphasis added)

Being able to select various I/O drivers for a pin in an FPGA is relatively common, but generally quite limited. Very high speed SERDES pins often cannot be reassigned or are restricted in what else they can be used for, because the high speed interface is sensitive to layout and loading. If Achronix has developed robust I/O muxing with more flexibility, this would be very interesting to Intel. It gets them closer to having a small selection of silicon dies, with different IP loads to target specific markets.

Using FPGAs as a way to tailor chips for specific markets makes a lot more sense than algorithm offload, IMHO. This provides products which could not otherwise exist, as it would be difficult to justify the incremental cost of each different chip. Amortizing the cost of silicon development over a much larger number of different applications makes more sense.

Wednesday, August 18, 2010

x86 vs ARM Mobile CPUs

The ARM architecture dominates mobile computing. It is used in all popular mobile phones and in a huge percentage of battery powered devices generally. This is due partly to its good overall performance, but especially due to its performance per watt expended. ARM chips consume very little power when compared to x86, and ARM's power consumption still excels even when compared to other RISC chips. At one time even Intel manufactured ARM chips, the result of its purchase of the DEC semiconductor business and its excellent StrongARM design. In 2006 Intel sold its ARM products to Marvell Semiconductor, committing to x86 for every segment of the computing market.

Its easy to assume that this state of affairs will continue, and that Intel will never successfully compete in the mobile market. I suspect that is too simplistic an assumption. There are two main sources of power dissipation in modern microprocessors: the power consumed by transistors actively switching, and the power lost to leakage current.

active current, leakage current into substrate

x86 vs ARM: Active Power

It requires power to switch a CMOS transistor 0->1 or 1->0, so one way to reduce power consumption is to have fewer transistors and to switch them at a lower frequency. x86 is at a disadvantage here compared to ARM, which Intel and AMD's design teams have to cover with extra work and cleverness. The vagaries of the x86 instruction set burdens it with hardware logic which ARM does not require.

Since the Pentium Pro, Intel has decoded complex x86 instructions down to simpler micro-ops for execution. AMD uses a similar technique. This instruction decode logic is active whenever new opcodes are fetched from RAM. ARM has no need for this logic, as even its alternate Thumb encoding is a relatively straightforward mapping to regular ARM instructions.
x86_32 exposes only a few registers to the compiler. To achieve good performance, x86 CPUs implement a much larger number of hardware registers which are dynamically renamed as needed. ARM does not require such extensive register renaming logic.
Every ARM instruction is conditional, and simple if-then-else constructs can be handled without branches. x86 relies much more heavily on branches, but frequent branches can stall the pipeline on a processor. Good performance in x86 requires extensive branch prediction hardware, where ARM is served with a far simpler implementation.

x86 vs ARM: Leakage Current

Intel Nehalem processor die Leakage current became a significant contributor to power consumption in 2003 with the move from 0.18 to 0.13 micron feature sizes, and has become more significant in each subsequent generation. The industry is now moving into 0.032 micron technologies.

A capacitor is formed when two conductive materials are separated by an insulator, called the dielectric. The capacitance is determined by the quality of the insulating material, quantified by the dielectric constant k. Higher k means more capacitance. "Leakage" is current which is able to flow out of the ASIC transistors and into the silicon substrate. To reduce the current leaking out, one needs to make a better dielectric between the transistor and the bulk of the silicon. This is generically referred to as high-k silicon technology.

As we're now talking about silicon fabrication techniques, we have to start talking about Intel specifically rather than the x86 architecture in general. Intel began using a high-k dielectric in production in 2007, during the 45 nm generation of parts. The rest of the industry has been experimenting with such materials, but is only now rolling it into the 32 nm generation. Intel hasn't stopped working on the technique, their 32 nm process benefits from the last several years of experience.

x86 vs ARM: Predicting The Future

Leakage current becomes more significant with each generation of process technology. The power consumed by actively switching transistors has been radically reduced over the last few years, leaving leakage as the more significant source of current consumption. It is difficult to estimate how serious the effect is, but this article from March 2008 shows leakage current starting out relatively insignificant in 180 nm silicon but growing to nearly 40% of total power consumption in a 50 nm process.

So far as I can see, this trend will continue. Leakage current will soon become the dominant factor in CPU power consumption. In fact, in 32 nm processes it might already be the primary factor. This is where the game changes: the advantage for total power consumption shifts away from the efficiency of the CPU architecture and design, and to the process technology of the fab. Presumably, this trend informed Intel's decision to sell their ARM assets to Marvell: there is little reason to enrich a competitor if the advantages of doing so will diminish over time.

There is still room for clever design, of course. To reduce active power consumption, processor designs have long stopped the clock to unused portion of the CPU. To reduce leakage current, AMD is taking the next step to actually remove the power supply to those portions of the CPU. For ARM, that design choice makes even more sense. ARM has no control over the fab, their designs have to minimize assumptions about the underlying silicon technology.

Right now ARM reigns supreme in the mobile space, but the strengths which gave it an advantage over x86 are rapidly becoming less compelling. Having to compete directly on silicon process sophistication moves the game onto Intel's turf, which Intel is happy to capitalize on with its Medfield platform. Its a great time to be in the mobile space.

Saturday, July 24, 2010

WWMD?

When Apple announced its A4 ARM CPU, I speculated about what would be in it. This speculation turned out to be completely wrong, but it was fun to write and engendered some good conversations about the possibilities. Now that Microsoft has signed an architecture license for ARM, I'm going to do it again. This article is complete speculation, and therefore rubbish. I am reliant on the same public sources of information as everyone else. Here we go.

What Will Microsoft Do?

Many companies license core designs from ARM, building them into chips with peripherals to add functionality. The various levels of ARM license offer both synthesized gate-level netlists and encrypted synthesizable RTL. An architectural license is much more extensive, allowing development of entirely new implementations of the ARM instruction set. Only the architectural license conveys the unencrypted, modifiable source code for the processor design. According to news reports, three other companies currently have an architectural license: Qualcomm, Marvell Semiconductor and Infineon Technologies. Qualcomm develops the Snapdragon, a line of ARM CPUs with integrated DSP and various mobile-related features. Marvell now owns the license originally used by DEC for development of the StrongARM, and under which Intel later produced the XScale. Infineon entered into the licensing agreement in late 2009, and will focus on security applications. Infineon is one of the few companies which embeds DRAM cells into the same die as the logic, and presumably will use this capability for HD SIM cards and other applications.

More notable than the list of architecture licensees is the list of companies which do not license the architecture. Samsung, which produces vast numbers of ARM CPUs, is content to work with ARM cores. Apple uses a Cortex A8 core in the A4, and does not have an architectural license either. Neither do TI, Cirrus Logic, or Atmel. Most companies use ARM core designs, and spend their efforts on the logic surrounding the CPU.

I tend to agree with The Register's take on it: Microsoft's ARM license is all about the XBox. Windows Mobile phones and the Zune use ARM, but there is little justification in producing their own chip for these markets. The XBox is the one hardware product which Microsoft produces itself, which can gain a competitive advantage via a unique CPU design, and which sells in large enough volume to be worth it. Microsoft already employs a great deal of custom silicon from suppliers in the product, such as the XCGPU used in the XBox 360-S. This chip combines the main PowerPC CPU and the ATI GPU onto a single die, with a second DRAM die incorporated into the package.

Its the Architecture, Stupid

So what might Microsoft do? I'll speculate that they won't design their own entirely new pipeline, the return on investment seems slim compared to other things they could spend time on. Its more likely they'd start from an existing ARM core and begin making changes. Microsoft will certainly integrate a powerful GPU onto the processor die, not doing so would be a step backwards from the existing XBox 360-S. I'll speculate they will tightly couple the GPU, allowing very low latency access to it as an ARM coprocessor in addition to the more straightforward memory mapped device. This is not unique: some of the on-chip XScale functional units can be accessed both as coprocessors for low latency and as memory mapped registers to get to the complete functionality of the unit. Having very low latency access to the GPU would allow efficient offloading of even small chunks of processing to GPU threads.

Yet even ARM coprocessors can be designed without needing an architectural license. TI implements its DaVinci DSP as a coprocessor, and Cirrus Logic had its own Maverick Crunch FPU. Neither company is an architectural licensee. So why would Microsoft feel it needs one?

One possibility is to let the GPU directly access the ARM processor cache and registers. This would allow GPU offloading to work almost exactly like a function call, putting arguments into registers or onto the stack with a coprocessor instruction to dispatch the GPU. When the GPU finishes, the ARM returns from the function call. For operations where the GPU is dramatically better suited, the ARM CPU would spend less time stalled than it would take to compute the result itself. If the ARM CPU supported hardware threads, it could switch to a different register file and run some other task while the GPU is crunching.

Part of the success of the XBox is due to its straightforward programming model compared to the Sony PS3. XBox has a fast SMP CPU paired with a GPU, where PS3 has an unruly gaggle of Cell processors to be managed explicitly. XBox cannot rely on the individual cores getting faster, as single core performance has leveled off due to power dissipation constraints. XBox has to make it easy for game developers to take advantage of more cores. Tightly coupling the GPU threads so they can function more like one big SMP system is one avenue to do this.

Wrapping up

I'll say it again: I made this all up. I have no insight into the specifics of Microsoft's intentions, just speculation. In the unlikely event that anyone reads this, don't copy it into Wikipedia as though it were verified information.

Thursday, July 8, 2010

Virtual Instruction Sets: Opcode Arguments

Its a virtual CPU fan, get it? Virtual Machine architectures are a fascinating topic, and one that I plan to occasionally explore in this space. Not virtual machines in the sense of VMWare or Xen, rather the runtime environment for a programming language like Java or Python. This time we'll focus on the structure of the instruction set, in particular on how operands are passed and stored. Why are these low level details important?

Traditional compilers emit instruction sequences without knowing anything about the specific CPU model, system configuration, or input data to be processed.
The compiler can optimize for a specific CPU pipeline, and maybe even produce multiple binaries for different CPUs. As a practical matter you cannot produce a large number of variations due to the sheer size of the final binary image.
Profile-driven compilation can optimize for representative data you supply during the build phase, but representative data is always a guess and a compromise. Also as a practical matter, its difficult to use profile-driven optimization for many applications, such as GUIs.
Only a JIT for a virtual machine has the luxury of knowing the specific CPU, system configuration, and has profiling information from the current input data.

The hardware CPU architectures we use now have evolved in lockstep with compiler technology, and mostly C/C++ compilers at that. They have enormous I$ and D$ because the compiler cannot predict very much about what it will execute next. The hardware has extensive branch prediction logic and history tracking because compilers emit an average of one branch every 7 instructions.

Virtual Machines change everything: by profiling the running code they can produce instructions for this specific workload, resulting in long sequences of very predictable opcodes without branches or conditionals. It has the potential to change hardware architectures, once we pass a tipping point where most of the workload runs within a VM. I suspect this tipping point will be reached in mobile devices well before it impacts workstations, laptops, or servers.

This rosy prediction is by no means certain. The JIT for most current VMs will compile a function the first time it is used. They can optimize for the CPU and possibly even take memory size into account, but they don't use any profiling information. Thus the JIT can potentially get the benefit of compiling for the specific CPU pipeline on which it runs, though in practice even this isn't typically done. So far as I know of the VMs discussed here only Mozilla's Spidermonkey makes use of tracing to produce specifically optimized routines according to the input data being processed.

We're going to examine seven virtual machines, focussing on how operands are passed: the JVM, CLR, Spidermonkey, LLVM, Parrot, V8, and Dalvik.

JVM & CLR

JVM argument stack The Java Virtual Machine and the Common Language Runtime used by .Net are certainly very different, but as virtual machines go they have a lot in common. Both are stack based: operands to an instruction are popped from the stack, and the result is pushed.

Stack based virtual machines are relatively common, because they are conceptually very simple. Indeed many early microprocessors and microcontollers were stack based, because the silicon technology of the day wouldn't allow a CPU with a generous number of registers on the die. In that sense virtual CPUs are following the same evolutionary path as hardware CPUs did several decades ago, starting with stack based machines and adding registers later.

Stack-based instruction sets tend to have a very high code density, because their opcodes don't need to encode source and destination register numbers. When the JVM was developed in the early 1990s, processor caches were measured in the tens of kilobytes. A densely packed bytecode was a big advantage, far more bytecode could be stored in the hardware CPU's data cache.

Spidermonkey (Firefox)

SpiderMonkey is the Javascript engine in Firefox, and is a stack-based machine like the JVM and CLR. What I find most interesting about SpiderMonkey is that it tackled profile-driven JIT optimization first, via TraceMonkey in the latter part of 2008. A more conventional method-compiling JIT came later, via JaegerMonkey in early 2010. The virtue of doing things in this order is pretty compelling: tracing, when it works, can deliver spectacular gains. However tracing really only helps with loops, leaving lots of low hanging fruit for a method-based JIT. Doing the method-based JIT first makes it more difficult to get the profiling information which tracing needs. By doing TraceMonkey first, its instrumentation needs became part of the requirements for JaegerMonkey.

LLVM

The primary design point of the LLVM project is a compiler toolchain, and the LLVM instruction set was designed to be the intermediate representation between the language-specific frontend and more generic backend. The LLVM instruction set defines a register based virtual machine with an interesting twist: it has an infinite number of registers. In keeping with its design point as a compiler intermediate representation, LLVM registers enable static single assignment form. A register is used for exactly one value and never reassigned, making it easy for subsequent processing to determine whether values are live or can be eliminated.

Parrot

Parrot is also a register based virtual machine. It defines four types of registers:

Integers
Numbers (i.e. floating point)
Strings
Polymorphic Containers (PMCs), which reference complex types and structures

Like LLVM, Parrot does not define a maximum number of registers: each function uses as many registers as it needs. Functions do not re-use registers for different purposes by storing their values to memory, they specify a new register number instead. The Parrot runtime will handle assignment of virtual machine registers to CPU registers.

So far as I can tell, integer registers are the width of the host CPU on which the VM is running. A Parrot bytecode might find itself using either 32 or 64 bit integer registers, determined at runtime and not compile time. This is fascinating if correct, though it seems like BigNum handling would be somewhat complicated by this.

V8 (Chrome)

V8 is the JavaScript engine in the Chrome browser from Google. Its a bit of a misnomer to call V8 a virtual machine: it compiles the Javascript source for a method directly to machine code the first time it is executed. There is no intermediate bytecode, and no interpreter. This is an interesting design choice, but for the purposes of this article there isn't much to say about V8.

Dalvik (Android)

Dalvik virtual machine registers Dalvik is the virtual machine for Android application code. The Dalvik instruction set implements an interesting compromise: it is register based, but there are a finite number of them as opposed to the theoretically infinite registers of LLVM or Parrot. Dalvik supports 65,536 registers, a vast number compared to hardware CPUs and presumably sufficient to implement SSA (if desired) in reasonably large functions.

Even more interestingly, not all Dalvik instructions can access all registers. Many Dalvik instructions dedicate 4 bits to the register number, requiring their operands to be stored in the first 16 registers. A few more instructions have an 8 bit instruction number, to access the first 256. There are also instructions to copy the value to or from any of the 65,536 registers to a low register, for a subsequent instruction to access.

It took a while to understand the rationale for this choice, and I'm still not confident I fully get it. Clearly the Dalvik designers believe that keeping data in one of the high registers will be faster than explicitly storing it to memory, even if the vast number of registers end up mostly residing in RAM. Addressing data as register numbers instead of memory addresses should make it easier for the VM to dynamically remap Dalvik registers to the real hardware registers. For example, if it can predict that virtual register 257 will likely be used in the near future it can be kept in a CPU register instead of being immediately stored to memory.

Other VMs

There are many, many more virtual machine implementations beyond the ones implemented here. The Python, Smalltalk, and Lua programming languages each have their own VM instruction set and implementation. Erlang started with a VM called JAM, and later reimplemented the underpinnings in a new virtual machine called BEAM. Adobe Flash has a VM which has been open sourced and donated to the Mozilla project as Tamarin. Wikipedia lists brief descriptions of a number of current VMs.

Thursday, July 1, 2010

Voyager 2 Soft Error

Voyager 2 is currently traversing the heliosphere, the shockwave where the solar wind and particles of the interstellar medium meet. Decades after launch the Voyager probes are still transmitting data back to Earth, at 160 bits per second. Starting April 22, 2010, the checksums in the Voyager 2 datastream indicated problems in every frame. Something bad had happened to the spacecraft.

After several weeks of testing, JPL determined the root cause to be a flipped bit in the computer's memory. A bit had spontaneously changed from zero to one. JPL duplicated the incorrect checksum symptom by deliberately flipping that bit on a terrestrial copy of the Voyager computer system. On May 19, 2010 a command was sent to reset the bit to zero, and the next day (after 13 hour propagation delay each way) Voyager 2's datastream returned to normal.

It appears to have been a soft error. What is most interesting about this is the technology involved: the Voyager computer systems use magnetic core memory. On Earth, soft errors in core memory of this vintage are essentially impossible. The amount of energy required to flip the bit is so large that any particle with sufficient charge would have been deflected by the Earth's magnetic field. Out at the heliosphere, particles carrying sufficient energy to affect the core memory are apparently present.

Software engineers at JPL are currently working on a software patch to stop using that bit in the core. It is possible that the problem wasn't purely random, that instead this particular bit of hardware is degrading and no longer holding its state reliably.

Thursday, April 8, 2010

Simple Checksums Considered Harmful

Lets talk about iSCSI for a moment, as a launching point for a discussion about data integrity. iSCSI relies on CRC32 to catch data corruption. CRC32 is a good fit for this purpose, but most previous uses of it had been confined to very low levels of the system and implemented in hardware. iSCSI uses CRC32 way up in the protocol header, where it is generally computed in software. The overhead of computing the CRC is one reason why so many hardware offload adaptors were developed for iSCSI.

Intel recently released a whitepaper describing how they achieved 1 million iSCSI operations per second. One fascinating tidbit is that the CRC32 is no longer a bottleneck. The Nehalem architecture includes an instruction to compute it directly, as part of SSE 4.2. The new instruction is described in the Intel64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M. It is on page 3-221 of the December 2009 edition; search for CRC32 in later editions.

CRC32 r32, r/m8	Accumulate CRC32 on r/m8
CRC32 r32, r/m16	Accumulate CRC32 on r/m16
CRC32 r32, r/m32	Accumulate CRC32 on r/m32
CRC32 r64, r/m8	Accumulate CRC32 on r/m8
CRC32 r64, r/m64	Accumulate CRC32 on r/m64

Thats it. You load words from memory and hand them to the CRC32 instruction. If you were already making a pass over the data for any reason, the CRC calculation is free. Table-driven CRC generation implementations were already fast, but this is even faster.

What does this mean? I think it means weak checksums should no longer be used for anything. Applications which care about data integrity moved to MD5 or SHA1 years ago, but you still see specifications in other contexts written to use Adler-32 or even the venerable 16-bit TCP checksum. Its not appropriate to use these any more. Server CPUs can compute CRC32 for free, and embedded CPUs have long included CRC32 calculation in DMA engines.

Friday, February 5, 2010

Apple == A, Plus Four More Letters

What the web needs right now is another blog post about the iPad.

Apple A4 chip No, don't run away! This will be different, I promise. We'll focus on Apple's A4, a custom CPU first used in the iPad. It has been widely assumed that A4 uses a licensed ARM Cortex A8 core. I have no reason to dispute this assertion, it seems like a fine choice. It has also been asserted that because the same core is used in parts from Samsung, TI, and Qualcomm, Apple should not have bothered making its own chip. Today, Gentle Reader, we'll explore that notion a bit.

Apple ASIC Expertise

Apple has a long history of ASIC design. Apple produced custom silicon for various Macintosh models since at least the late 1980s, when they designed the audio chips used in the Quadra 700 and 900 (a chip called "Batman"). Later, Apple designed entire chipsets to interface with the PowerPC 60x bus. Apple licensed a gigabit Ethernet MAC design from Sun, and used it plus IDE controller and other peripherals in chipsets for several Powermac models. With the switch to x86, Apple's efforts became much more constrained. The x86 bus interface is difficult to license, and Intel's own chipsets are quite reasonable. So far as I know, x86 Macintoshes no longer use custom Apple ASICs.

Custom chip design isn't a radical departure for Apple.

With that as background, what might Apple have done in the A4 chip? I have absolutely no inside information about the iPad or A4 processor, I'm going to make stuff up because its fun to speculate.

Graphics & OpenCL

PowerVR SGX CPU Overview Apple holds a nearly 10% stake in Imagination Technologies Group, which designs the PowerVR graphics accelerator and other IP relating to massively threaded processing. Apple uses their PowerVR SGX 535 in iPhone 3GS, and used various PowerVR graphics in earlier iPhone models. The A4 chip will certainly integrate a graphics core from PowerVR. As with essentially all GPU designs today, the PowerVR makes use of multiple, specialized CPU cores. There is relatively little information about its instruction set on the web, ~~only that it is called META MTX and uses 16 bit RISC-ish instruction words~~. Update: PowerVR SGX does not use the META architecture, it has a distinct architecture of its own. Additional information can be downloaded after registration.

Apple has also invested heavily in two relevant technologies: OpenCL and LLVM.

OpenCL allows processing to be distributed across multiple CPUs in the system, even if they have different instruction sets. OpenCL algorithms are written in a language with syntax very similar to C99, and the framework handles the rest.
LLVM is a compiler toolkit, one aspect of which is a machine independent instruction set. Source code can be compiled to the LLVM virtual machine, and from there be translated into the equivalent opcodes for the target CPU. The compilation can be done statically before running it, or by a Just-In-Time compiler while interpreting the LLVM bytecodes.

iPhone applications are compiled to ARM instructions, but it is not much of a stretch to imagine support for sections of LLVM bytecodes as well. If the hardware has sufficient GPU power, the bytecode could be translated to the GPU instruction set and offloaded. Devices with less sophisticated GPUs would use the ARM instead. Apple does not allow iPhone apps to include their own virtual machine in this way, but would be free to provide the VM function as part of the OS.

I suspect this is the most compelling reason for Apple to build its own chip as opposed to buying off the shelf. The rest of the mobile industry is satisfied to offload 3D graphics and video decoding to the GPU. Apple has greater ambitions, and could make use of significantly more GPU pipelines. By controlling the complete platform from CPU to software, Apple can make tradeoffs which are not practical for the rest of the market. For example: a very large GPU plus very fast ARM would generate more heat than can be dissipated in a small form factor like a phone. Apple has the option to dynamically throttle the ARM clock speed in order to open up more thermal envelope for the GPUs, if sufficient OpenCL workload is ready to run. When the GPUs are less busy, the ARM clock speed can be brought back up.

Multi Package Modules

The CPU in the iPhone 3GS is a Samsung S5PC100. This is a multi package module with CPU, I/O chip, and SDRAM sandwiched tightly together. Multi chip modules have been around for a long time, where multiple dies wired together in one big package. The amount of testing which can be done on a raw die is rather limited, so MCM yields suffer as one bad die ruins the whole assembly.

Multi package modules are relatively new: each chip is in its own package, but use very tight pin spacings and do not have a heat spreader. They are soldered together on a small PCB, which in turn has a Ball Grid Array on the bottom with normal pin spacings. Because each chip is packaged separately a full suite of test vectors can be run before the final assembly is put together, improving the yield considerably and lowering the cost of the final product.

If we examine the main board of an iPhone 3GS, the largest component is not the Samsung processor - it is the Flash memory, an MPM containing a number of flash chips. The Samsung CPU in the iPhone 3GS comes in a close second in size, and is also a multi package module with CPU, I/O chip, and SDRAM.

With the A4, Apple will probably have one die containing both CPU and I/O. Samsung uses different I/O chips to tailor their offering to many market segments, which is not a goal for Apple. By arranging the pinout carefully, Apple might be able to make an MPM containing CPU, SDRAM, and Flash, reducing the total board area. Different Flash capacities could be offered by not stuffing portions of the MPM. The iPad itself might not need such an MPM as it is a much larger device, but future iPhones would benefit more.

To be clear: assembling an MPM is not something you can easily do when buying merchant silicon. The pins on one package have to be arranged so as to be easy to route to the pins on the other packages within the MPM.

Wrapping up

I'll say it again: I made this all up. I have no information on the specifics of the A4, just speculation. In the unlikely event that anyone reads this (instead of running away from yet another iPad blog post), don't copy it into Wikipedia as though it were verified information.

What about future iterations? Its tempting to consider a single chip containing the entire iPhone feature set, including radio and wireless networking. The A4 itself clearly doesn't do this, as GSM support is optional in the iPad. I suspect that even in future chips, Apple won't pull in the baseband radio. The front end portions of that chip are rather sensitive to noise, and generally don't work well when integrated in the corner of a gigantic ASIC. Also integrating the radio functionality would make it that much harder to keep up with advancements in wireless networks.

Another future possibility is to use this chip in Apple's other small form factor products, like the Airport Base Stations, Time Capsule, and AppleTV. This is certainly possible, but aside from obvious additional peripherals like SATA I'm not sure it adds many requirements to the chip.

Other articles about A4 you might find interesting:

The New York Times writes about the history of the A4 design team.
Louis Gray writes about Apple's heavy recruiting push to staff up their ASIC team.
Wikipedia already has an article, which will improve over time as more details emerge.

P.S.: While we're at it, the title of this post is a guess about the origins of the "A4" nomenclature: "Apple" is a capital A followed by four more letters.

Friday, January 15, 2010

Intel Acquiring FPGA Vendor?

EE Times reports on a JP Morgan Analyst prediction that Intel will acquire an FPGA vendor. The purported reason: to expand its competitiveness in embedded systems and system-on-chip. The two obvious market leaders in that category are Altera and Xilinx, though there are several smaller vendors like Actel and Lattice as well.

The SoC angle is interesting, in terms of the disruptive change it might allow. Freescale carries a huge variety of part numbers with various combinations of PowerPC core plus networking, USB, CAN-BUS, encryption, etc. Some of the functionality is implemented via an independant communications processor (a 68k descendant) alongside the PowerPC, to try to make each chip more flexible and able to serve different markets. Nonetheless, its still a very large collection of chips. Intel could be aiming for just a few different parts, with embedded FPGA blocks of various sizes and I/O pinouts. Need CAN-BUS? Buy a part with the right type of pins bonded out, with a license for soft logic IP to load into it. More sophisticated customers could load their own design logic into the FPGA blocks.

At one time Xilinx offered parts with hard logic PowerPC cores, but the CPU performance was modest and did not remain competitive over time. Xilinx and Altera both now emphasize soft logic CPU cores instead. These certainly work... but implementing a CPU in FPGA gates is an awfully expensive way to get your software to run. If Intel were to enter this space it would come from the opposite direction: a modern CPU core paired with a modest amount of FPGA logic.

EETimes published a subsequent rebuttal of the acquisition rumor, throwing around big numbers about the premium to be paid for Altera or Xilinx. The numbers make my head hurt, but its worth a read if you're interested in the topic.

Friday, October 23, 2009

ARM Cortex A5

On October 21 ARM announced a new CPU design at the low end of their range, the Cortex A5. It is intended to replace the earlier ARM7, ARM9, and ARM11 parts. The A5 can have up to 4 cores, but given its positioning as the smallest ARM the single core variant will likely dominate.

To me the most interesting aspect of this announcement is that when the Cortex A5 ships (in 2011!) all ARM processors, regardless of price range, will have an MMU. The older ARM7TDMI relied on purely physical addressing, necessitating the use of either uClinux, vxWorks, or a similar RTOS. With the ARM Cortex processor family any application requiring a 32 bit CPU will have the standard Linux kernel as an option.

Story first noted at Ars Technica

Update

In the comments Dave Cason and Brooks Moses point out that I missed the ARM Cortex-M range of processors, which are considerably smaller than the Cortex-A. The Cortex-Ms do not have a full MMU, though some parts in the range have a memory protection unit. So it is not the case that all ARM processors will now have MMUs. Mea culpa.

Wednesday, October 14, 2009

AMD IOMMU: Missed Opportunity?

In 2007 AMD implemented an I/O MMU in their system architecture, which translates DMA addresses from peripheral devices to a different address on the system bus. There were several motivations for doing this:

Virtualization: DMA can be restricted to memory belonging to a single VM and to use the addresses from that VM, making it safe for a driver in that VM to take direct control of the device. This appears to be the largest motivation for adding the IOMMU.
High Memory support: For I/O buses using 32 bit addressing, system memory above the 4GB mark is inaccessible. This has typically been handled using bounce buffers, where the hardware DMAs into low memory which the software will then copy to its destination. An IOMMU allows devices to directly access any memory in the system, avoiding copies. There are a large number of PCI and PCI-X devices limited to 32 bit DMA addresses. Amazingly, a fair number of PCI Express devices are also limited to 32 bit addressing, probably because they repackage an older PCI design with a new interface.
Enable user space drivers: A user space application has no knowledge of physical addresses, making it impossible to program a DMA device directly. The I/O MMU can remap the DMA addresses to be the same as the user process, allowing direct control of the device. Only interrupts would still require kernel involvement.

I/O Device Latency

Multiple levels of bus bridging PCIe has a very high link bandwidth, making it easy to forget that its position in the system imposes several levels of bridging with correspondingly long latency to get to memory. The PCIe transaction first traverses the Northbridge and any internal switching or bus bridging it contains, on its way to the processor interconnect. The interconnect is HyperTransport for AMD CPUs, and QuickPath for Intel. Depending on the platform, the transaction might have to travel through multiple CPUs before it reaches its destination memory controller, where it can finally access its data. A PCIe Read transaction must then wend its way back through the same path to return the requested data.

Much lower latency comes from sitting directly on the processor bus, and there have been systems where I/O devices sit directly beside CPUs. However CPU architectures rev that bus more often than it is practical to redesign a horde of peripherals. Attempts to place I/O devices on the CPU bus generally result in a requirement to maintain the "old" CPU bus as an I/O interface on the side of the next system chipset, to retain the expensive peripherals of the previous generation.

The Missed Opportunity: DMA Read pipelining

An IOMMU is not a new concept. Sun SPARC, some SGI MIPS systems, and Intel's Itanium all employ them. Once you have taken the plunge to impose an address lookup between a DMA device and the rest of the system, there are other interesting things you can do in addition to remapping. For example, you can allow the mapping to specify additional attributes for the memory region. Knowing whether it is likely to do long, contiguous bursts or short concise updates allows optimizations to reduce latency by reading ahead, to transfer data faster by pipelining.

Without prefetch (CONSISTENT)	With prefetch (STREAMING)

AMD's IOMMU includes nothing like this. Presumably they wanted to confine the software changes to the Hypervisor alone, whilst choosing STREAMING versus CONSISTENT requires support in the driver of the device initiating DMA, but they could have ensured software compatibility by making CONSISTENT be the default with STREAMING only used by drivers which choose to implement it.

What About Writes?

The IOMMU in SPARC systems implemented additional support for DMA write operations. Writing less than a cache line is inefficient, as the I/O controller has to fetch the entire line from memory and merge the changes before writing it back. This was a problem for Sun, which had a largish number of existing SBus devices issuing 16 or 32 byte writes while the SPARC cache line had grown to 64 bytes. A STREAMING mapping relaxed the requirement for instantaneous consistency: if a burst wrote the first part of a cache line, the I/O controller was allowed to buffer it in hopes that subsequent DMA operations would fill in the rest of the line. This is an idea whose time has come... and gone. The PCI spec takes great care to emphasize cache line sized writes using MWL or MWM, an emphasis which carries over the PCIe as well. There is little reason now to design coalescing hardware to optimize sub-cacheline writes.

Without buffering (CONSISTENT)	With buffering (STREAMING)

Closing Disclaimer

Maybe I'm way off base in lamenting the lack of DMA read pipelining. Maybe all relevant PCIe devices always issue Memory Read Multiple requests for huge chunks of data, and the chipset already pipelines data fetch during such large transactions. Maybe. I doubt it, but maybe...

Thursday, October 1, 2009

Yield to Your Multicore Overlords

ASIC design is all about juggling multiple competing requirements. You want to make the chip competitive by increasing its capabilities, by reducing its price, or both. Today we'll focus on the second half of that tradeoff, reducing the price.

Chip fabrication is a statistical game: of the parts coming off the fab, some percentage simply do not work. The vendor runs test vectors against the chips, and throws away the ones which fail. This is called the yield, and is the primary factor determining the cost of the chip. If the yield is bad, so you have to fab a whole bunch of chips to get one that actually works, you have to charge more for that one working chip.

To illustrate why most chips coming out of the fab do not work, I'd like to walk through part of manufacturing a chip. This information is from 1995 or so, when I was last seriously involved in a chip design, and describes a 0.8 micron process. So it is completely old and busted, but is sufficient for our purposes here.

Begin by placing the silicon wafer in a nitrogen atmosphere. You deposit a photo-resist on the wafer, basically a goo which hardens when exposed to ultraviolet light. You place a shadow mask in front of a light source; the regions exposed to light will harden while those under the shadow mask remain soft. You then chemically etch off the soft regions of the photo-resist, leaving exposed silicon where they were. The hardened regions of photo-resist stay put.

Next you heat the wafer to 400 degrees and pipe phosphorous into the nitrogen atmosphere. As the K atoms heat they begin moving faster and bouncing off the walls of the chamber. Some of them move fast enough that when they strike the surface of the wafer they break the Si crystal lattice and embed themselves in the silicon. If they strike the hardened photo-resist, they embed themselves in the resist; very, very few are moving fast enough to crash all the way through the photoresist into the silicon underneath.

Electron Microscopy image of a silicon die

Next you use a different chemical process to strip off the hardened photoresist. You are left with a wafer which has phosphorous embedded in the places you wanted. Now you heat the wafer even higher, hot enough that the silicon atoms can move around more freely; they move back into position and reform the crystal lattice, burying the phosphorous atoms embedded within. This is called annealing. Now you have the p+ regions of the transistors.

You repeat this process with aluminum ions to get the n- regions. Now you have transistors. Next you connect the transistors together with traces of aluminum (which I won't go into here). You cut the wafer to separate the die, and place each die in a package. You connect bonding wires from the edge of the die to the pins of the chip. And voila, you're done.

It should be apparent that this is a probabalistic process. Sometimes, based purely on random chance, not enough phosphorous atoms embed themselves into the silicon and your transistors don't turn on. Sometimes too much phosphorous embeds and your transistors won't turn off. Sometimes the Si lattice is too badly damaged and the annealing is ineffective. Sometimes the metal doesn't line up with the vias. Sometimes a dust particle lands on the chip and you deposit metal on top of the dust mote. Sometimes the bonding doesn't line up with the pads. Etc etc.

This is why the larger a chip grows, the more expensive it becomes. Its not because raw silicon wafers are particularly costly, its that the probability of there being a defect somewhere grows ever greater as the die becomes larger. The bigger the chip, the lower the yield of functional parts.

Intel Nehalem chip with SRAM highlited For at least 15 years that I know of, chip designs have improved their yield using redundancy. The earliest such efforts were done in on-chip memory: if your chip is supposed to include N banks of SRAM, put N+1 banks on the die connected with wires in the top most layer which can be cut using a laser. The SRAM occupies a large percentage of the total chip area, statistically it is likely that defects will be within the SRAM. You can then cut out the defective bank, and turn a defective chip into one that can be used. More recent silicon processes use fuses blown by the test fixture instead of lasers.

Massively Multicore Processors

Yesterday NVidia announced Fermi, a beast of a chip with 512 Cuda GPU cores. They are arranged in 16 blocks of 32 cores each. At this kind of scale, I suspect it makes sense to include extra cores in the design to improve the yield. For example, perhaps each block actually has 33 cores in the silicon so that a defective core can be tolerated.

In order to avoid having weird performance variations in the product, the extra resources are generally inaccessible if not used for yield improvement. That is even though the chip might have 528 cores physically present, no more than 512 could ever be used.

Thursday, September 17, 2009

Jasper Forest x86

Intel has a long but uneven history in the embedded market. In the early days of the personal computer Intel released the 80286 as a followon to the original 8086. There actually was an 80186: it was a more integrated version of the 8086 aimed at embedded applications. Intel's interest in embedded markets has waxed and waned over the years, but it is an area where Intel still has room for significant growth.

I wrote about x86 for embedded use about a year and a half ago, with four main points:

Volume Discounts
PC pricing thresholds at 50,000 units have to be rethought for a less homogenous market
System on Chip (SoC)
Board space is at a premium, we need fewer components in the system
Production lifetime
These systems are not redesigned every few months, chips have to remain in production longer
Power and heat
Airflow is more constrained, and the system has other heat generating components besides the CPU complex

At the Intel Developer Forum next week Intel is expected to focus on embedded applications for its products. In advance of IDF Intel announced the Jasper Forest CPU, a System on Chip version of Nehalem. It is based on a 1, 2, or 4 core CPU plus an integrated PCI-e controller, so it does not need a separate northbridge chip. Intel also committed to a 7 year production lifetime, allowing the part to be designed into products which will remain on the market for a while. I'd speculate that Intel will offer industrial temperature grade parts as well, perhaps at lower frequencies.

Jasper Forest is particularly suited for and aimed at storage applications. It has additional hardware for RAID support (presumably XOR & ECC generation), and a feature to use main memory as a nonvolatile buffer cache. When loss of power is detected the chip will flush any pending writes out to RAM and then set the DRAM to self-refresh before shutting down. By including a battery sufficient to power the DRAM, the system can avoid the need for a separate nonvolatile data buffer like SRAM.

This is a good approach for Intel: target silicon at specific high margin, growing application areas. Go for markets with moderate power consumption requirements, as x86 is clearly not ready for small battery powered applications like phones. Ars Technica discusses Intel's upcoming weapon for getting into mobile and other battery powered markets, a version of their 32nm process which reduces leakage current to almost nothing. An idle x86 would consume essentially no power, which would be huge.

Tuesday, September 15, 2009

Soft Errors Are Hard Problems

"Soft Error" is a euphemism in the semiconductor industry for "the silicon did the wrong thing." Soft errors can occur when a circuit is infused with a sudden burst of energy from an external source, for example when it is hit by a high energy subatomic particle or by radiation.

Alpha particle strike - two protons plus two electrons, emitted when a heavy radioactive element decays into a lighter element. Alpha particles are so large that the chip packaging will normally block them, they are only a problem when something inside the package undergoes radioactive decay.

Cosmic ray strike - a high energy neutron (or other particle) emitted by the Sun. These particles are gradually absorbed by the Earth's atmosphere, so they are more of a problem in orbit and at high altitude. Cosmic rays can directly impact the silicon, or can hit a nearby atom and throw off neutrons which in turn cause a soft error.

Beta particle strike - an electron, emitted when a neutron is converted into a proton + electron + antineutrino. Beta particles rarely hold enough energy to affect current silicon technology, alpha and cosmic ray strikes are more of a problem.

A DRAM bit Soft errors are usually discussed in the context of DRAM, where the problem was initially noticed. DRAM consists of a capacitor to store the bit, with a transistor to keep the capacitor charge stable. A capacitor is an energy storage circuit: it stores voltage. A particle strike on the capacitor will impart a large amount of energy, which can spontaneously change it to a 1 or, more rarely, overload its capacity such that the energy quickly escapes into the substrate and leaves the bit as a 0.

Some quick searching will turn up a few facts about soft errors:

Soft errors were first noted in the 1970s.
The primary cause was the use of slightly radioactive isotopes in the chip packaging, such as lead (Pb-212).
Materials in chip packaging are now carefully screened to substantially eliminate radioactivity.
Soft errors are now very rare and mostly caused by cosmic rays.

We'll come back to these later.

Beyond DRAM A 6 transistor SRAM bit

Soft errors are not confined to DRAM alone. Any circuit will glitch if hit by a sufficiently energetic particle - whether DRAM, SRAM, or a logic element. DRAM began to be affected by soft errors when the energy stored in the capacitor shrunk to be on the order of the energy induced by an alpha particle. SRAM was not initially affected because its cells are actively driven via 6 transistors, whose energy level is considerably higher. Nonetheless as silicon feature sizes have shrunk it is now quite possible for SRAM to suffer a soft error. Soft errors in logic elements are somewhat less noticeable in that they will correct themselves on the next clock cycle, while an error in a storage element will persist until it is rewritten.

Intel Nehalem with SRAM highlighted Modern CPUs include a great deal of SRAM on the die, comprising the caches, TLBs, reorder buffers, and numerous other uses. The image shown here is Intel's quad core Nehalem die, with the SRAM areas highlighted and logic deemphasized (both based on my best guesses). SRAM is a significant fraction of the die. Many, many other ASIC and CPU designs contain similar or higher fractions of SRAM.

What impact can a bitflip in the SRAM have? Consider that there is just one bit difference between the following two instruction opcodes. A bitflip can make the software come up with results which should be impossible.

ADD R1,1

ADD R1,32769

Even more bizarrely a bitflip could change some random instruction into a memory reference, such as a load or store. As the register being dereferenced would likely not contain a valid pointer, the process would segfault for inexplicable reasons.

To prevent this problem Intel and all major CPU vendors protect their caches and other on-chip memories using Error Correcting Codes, but many ASIC designs do not. They might implement parity, but on-chip ASIC memories commonly have no error checking at all. A soft error will simply corrupt whatever was in the SRAM, which will only be noticed if it causes the ASIC to misbehave in some perceptible way.

What happens if a particle strike causes the hardware to misbehave, or to get the wrong answer? Usually, we blame the software. It must be some weird bug.

Back To The Future

Returning to the earlier list of common facts about soft errors, lets focus on the last two.

Materials in chip packaging are now carefully screened to substantially eliminate radioactivity.
Soft errors are now very rare and mostly caused by cosmic rays.

There are many different materials used in a finished chip, beyond the silicon die and the gold wires connecting to its pins. The chip package is plastic or ceramic, which is composed of a host of different elements including boron. Solder bonds the wires to the pins. Solder used to be mostly lead, later tin, and might now be a polymer. There are heat spreading compounds and shock absorption goo, which are often organic polymers. Some of these materials are naturally slightly radioactive. For example, in nature Lead (Pb-208) contains traces of Pollonium (Po212) which will emit alpha particles and decay into Pb-208. Similarly boron-10 is more prone to fission than boron-11 - or so they tell me, I've no idea why.

Modern chip packaging uses strained versions of these materials, to reduce the level of undesirable isotopes and leave purified inert material behind. This is an expensive process. Each new generation of silicon imposes more stringent requirements, making them even more costly. It is crucial that the correct materials be used... and this is where human error can creep in.

Using the wrong packaging materials increases alpha emissions to the point where the silicon will experience an unacceptable soft error rate. Yet to control costs it is also important to not overshoot the alpha emission requirements by using a more expensive material than necessary. So each chip design may use a different mix of materials depending on its process technology, die size, and the amount of SRAM it contains. The manufacturer will have checks in place to ensure the correct materials are used, but mistakes can occur. Sometimes a batch of chips is produced which suffers an unusually high soft error rate. When this happens the manufacturer is generally loathe to admit it, and will quietly replace the chips with a corrected batch. One recent case where the problem was too widespread to cover up was with the cache of the Ultrasparc-II CPU from Sun Microsystems, see point #5 of an Actel FAQ on the topic for more details.

The Moral of the Story

If you are working on low level software for a chip and run into bizarre errors, you should suspect a software problem first. Soft errors really are rare, and the manufacturing screwups described above are very uncommon. If the problem is repeatable, even if it has only happened twice, it is not a soft error. Particle strikes are too random for that. However if you keep running into different symptoms where you think "but that is impossible..." you should consider the possibility that the problem really is impossible and was introduced by a bitflip. You should start checking whether the problems are confined to a particular batch of parts, or only produced in a certain range of dates. If so, it could be that batch of parts has a problem with the packaging materials.

Other resources

While researching this post I came across a few additional sources of information which I found fascinating, but did not have a good place to link to them. They are presented here for your edification and bemusement.

An analysis of the high soft error rate of the ASC Q supercomputer at LLNL. These errors were cosmic ray induced, not due to a badly packaged batch of chips. The part of the system in question did not implement ECC to correct errors, only parity to detect them and crash.
Cypress Semiconductor published a book to explain soft errors to their customers.
Fujitsu has a simulator to predict soft error rates, called NISES. The linked PDF is mostly in Japanese, but the images are fascinating and very illustrative.

This post was many years in the making.