Coding Relic: September 2009

Monday, September 28, 2009

49.710269618056 Days

Western Digital recently corrected a firmware issue in certain models of VelociRaptor where the drive would erroneously report an error to the host after 49 days of operation. Somewhat inconveniently for RAID arrays, if all drives powered on at the same time they would all report an error at the same time.

Informed speculation: the drive reports an error after exactly 49 days, 17 hours, 2 minutes, 47 seconds, and 294.999 milliseconds of operation. That is the moment where a millisecond timer overflows an unsigned 32 bit integer.

Tuesday, September 22, 2009

A Pudgier Tux

At LinuxCon 2009 a discussion arose about the Linux kernel becoming gradually slower with each new release. "Yes, it's a problem," said Linus Torvalds. "The kernel is huge and bloated, and our icache footprint is scary. I mean, there is no question about that. And whenever we add a new feature, it only gets worse."

I think the addition of new features is a red herring, and the real problem is in letting Tux eat the herring. Just hide the jars, maybe get a treadmill, and everything will go back to the way it was.

Story originally noted in The Register.

Thursday, September 17, 2009

Jasper Forest x86

Intel has a long but uneven history in the embedded market. In the early days of the personal computer Intel released the 80286 as a followon to the original 8086. There actually was an 80186: it was a more integrated version of the 8086 aimed at embedded applications. Intel's interest in embedded markets has waxed and waned over the years, but it is an area where Intel still has room for significant growth.

I wrote about x86 for embedded use about a year and a half ago, with four main points:

Volume Discounts
PC pricing thresholds at 50,000 units have to be rethought for a less homogenous market
System on Chip (SoC)
Board space is at a premium, we need fewer components in the system
Production lifetime
These systems are not redesigned every few months, chips have to remain in production longer
Power and heat
Airflow is more constrained, and the system has other heat generating components besides the CPU complex

At the Intel Developer Forum next week Intel is expected to focus on embedded applications for its products. In advance of IDF Intel announced the Jasper Forest CPU, a System on Chip version of Nehalem. It is based on a 1, 2, or 4 core CPU plus an integrated PCI-e controller, so it does not need a separate northbridge chip. Intel also committed to a 7 year production lifetime, allowing the part to be designed into products which will remain on the market for a while. I'd speculate that Intel will offer industrial temperature grade parts as well, perhaps at lower frequencies.

Jasper Forest is particularly suited for and aimed at storage applications. It has additional hardware for RAID support (presumably XOR & ECC generation), and a feature to use main memory as a nonvolatile buffer cache. When loss of power is detected the chip will flush any pending writes out to RAM and then set the DRAM to self-refresh before shutting down. By including a battery sufficient to power the DRAM, the system can avoid the need for a separate nonvolatile data buffer like SRAM.

This is a good approach for Intel: target silicon at specific high margin, growing application areas. Go for markets with moderate power consumption requirements, as x86 is clearly not ready for small battery powered applications like phones. Ars Technica discusses Intel's upcoming weapon for getting into mobile and other battery powered markets, a version of their 32nm process which reduces leakage current to almost nothing. An idle x86 would consume essentially no power, which would be huge.

Tuesday, September 15, 2009

Soft Errors Are Hard Problems

"Soft Error" is a euphemism in the semiconductor industry for "the silicon did the wrong thing." Soft errors can occur when a circuit is infused with a sudden burst of energy from an external source, for example when it is hit by a high energy subatomic particle or by radiation.

Alpha particle strike - two protons plus two electrons, emitted when a heavy radioactive element decays into a lighter element. Alpha particles are so large that the chip packaging will normally block them, they are only a problem when something inside the package undergoes radioactive decay.

Cosmic ray strike - a high energy neutron (or other particle) emitted by the Sun. These particles are gradually absorbed by the Earth's atmosphere, so they are more of a problem in orbit and at high altitude. Cosmic rays can directly impact the silicon, or can hit a nearby atom and throw off neutrons which in turn cause a soft error.

Beta particle strike - an electron, emitted when a neutron is converted into a proton + electron + antineutrino. Beta particles rarely hold enough energy to affect current silicon technology, alpha and cosmic ray strikes are more of a problem.

A DRAM bit Soft errors are usually discussed in the context of DRAM, where the problem was initially noticed. DRAM consists of a capacitor to store the bit, with a transistor to keep the capacitor charge stable. A capacitor is an energy storage circuit: it stores voltage. A particle strike on the capacitor will impart a large amount of energy, which can spontaneously change it to a 1 or, more rarely, overload its capacity such that the energy quickly escapes into the substrate and leaves the bit as a 0.

Some quick searching will turn up a few facts about soft errors:

Soft errors were first noted in the 1970s.
The primary cause was the use of slightly radioactive isotopes in the chip packaging, such as lead (Pb-212).
Materials in chip packaging are now carefully screened to substantially eliminate radioactivity.
Soft errors are now very rare and mostly caused by cosmic rays.

We'll come back to these later.

Beyond DRAM A 6 transistor SRAM bit

Soft errors are not confined to DRAM alone. Any circuit will glitch if hit by a sufficiently energetic particle - whether DRAM, SRAM, or a logic element. DRAM began to be affected by soft errors when the energy stored in the capacitor shrunk to be on the order of the energy induced by an alpha particle. SRAM was not initially affected because its cells are actively driven via 6 transistors, whose energy level is considerably higher. Nonetheless as silicon feature sizes have shrunk it is now quite possible for SRAM to suffer a soft error. Soft errors in logic elements are somewhat less noticeable in that they will correct themselves on the next clock cycle, while an error in a storage element will persist until it is rewritten.

Intel Nehalem with SRAM highlighted Modern CPUs include a great deal of SRAM on the die, comprising the caches, TLBs, reorder buffers, and numerous other uses. The image shown here is Intel's quad core Nehalem die, with the SRAM areas highlighted and logic deemphasized (both based on my best guesses). SRAM is a significant fraction of the die. Many, many other ASIC and CPU designs contain similar or higher fractions of SRAM.

What impact can a bitflip in the SRAM have? Consider that there is just one bit difference between the following two instruction opcodes. A bitflip can make the software come up with results which should be impossible.

ADD R1,1

ADD R1,32769

Even more bizarrely a bitflip could change some random instruction into a memory reference, such as a load or store. As the register being dereferenced would likely not contain a valid pointer, the process would segfault for inexplicable reasons.

To prevent this problem Intel and all major CPU vendors protect their caches and other on-chip memories using Error Correcting Codes, but many ASIC designs do not. They might implement parity, but on-chip ASIC memories commonly have no error checking at all. A soft error will simply corrupt whatever was in the SRAM, which will only be noticed if it causes the ASIC to misbehave in some perceptible way.

What happens if a particle strike causes the hardware to misbehave, or to get the wrong answer? Usually, we blame the software. It must be some weird bug.

Back To The Future

Returning to the earlier list of common facts about soft errors, lets focus on the last two.

Materials in chip packaging are now carefully screened to substantially eliminate radioactivity.
Soft errors are now very rare and mostly caused by cosmic rays.

There are many different materials used in a finished chip, beyond the silicon die and the gold wires connecting to its pins. The chip package is plastic or ceramic, which is composed of a host of different elements including boron. Solder bonds the wires to the pins. Solder used to be mostly lead, later tin, and might now be a polymer. There are heat spreading compounds and shock absorption goo, which are often organic polymers. Some of these materials are naturally slightly radioactive. For example, in nature Lead (Pb-208) contains traces of Pollonium (Po212) which will emit alpha particles and decay into Pb-208. Similarly boron-10 is more prone to fission than boron-11 - or so they tell me, I've no idea why.

Modern chip packaging uses strained versions of these materials, to reduce the level of undesirable isotopes and leave purified inert material behind. This is an expensive process. Each new generation of silicon imposes more stringent requirements, making them even more costly. It is crucial that the correct materials be used... and this is where human error can creep in.

Using the wrong packaging materials increases alpha emissions to the point where the silicon will experience an unacceptable soft error rate. Yet to control costs it is also important to not overshoot the alpha emission requirements by using a more expensive material than necessary. So each chip design may use a different mix of materials depending on its process technology, die size, and the amount of SRAM it contains. The manufacturer will have checks in place to ensure the correct materials are used, but mistakes can occur. Sometimes a batch of chips is produced which suffers an unusually high soft error rate. When this happens the manufacturer is generally loathe to admit it, and will quietly replace the chips with a corrected batch. One recent case where the problem was too widespread to cover up was with the cache of the Ultrasparc-II CPU from Sun Microsystems, see point #5 of an Actel FAQ on the topic for more details.

The Moral of the Story

If you are working on low level software for a chip and run into bizarre errors, you should suspect a software problem first. Soft errors really are rare, and the manufacturing screwups described above are very uncommon. If the problem is repeatable, even if it has only happened twice, it is not a soft error. Particle strikes are too random for that. However if you keep running into different symptoms where you think "but that is impossible..." you should consider the possibility that the problem really is impossible and was introduced by a bitflip. You should start checking whether the problems are confined to a particular batch of parts, or only produced in a certain range of dates. If so, it could be that batch of parts has a problem with the packaging materials.

Other resources

While researching this post I came across a few additional sources of information which I found fascinating, but did not have a good place to link to them. They are presented here for your edification and bemusement.

An analysis of the high soft error rate of the ASC Q supercomputer at LLNL. These errors were cosmic ray induced, not due to a badly packaged batch of chips. The part of the system in question did not implement ECC to correct errors, only parity to detect them and crash.
Cypress Semiconductor published a book to explain soft errors to their customers.
Fujitsu has a simulator to predict soft error rates, called NISES. The linked PDF is mostly in Japanese, but the images are fascinating and very illustrative.

This post was many years in the making.

Friday, September 11, 2009

Motorola Mobile Meanderings

On September 10th Motorola announced the CLIQ, its first Android phone and a product which a good friend of mine has labored over for quite some time.

For many years I have used my trusty Sony-Ericsson T616, but I admit it might be getting a bit dated. I decided to compare the two devices to see whether its time for me to take up a new phone.

	T616	CLIQ
Cracked LCD screen?	Y	N
Broken down arrow key?	Y	N
Wallpaper picture of my Daughter?	Y	N

As you can see, clearly I cannot replace my T616 until Motorola rectifies these glaring omissions in the CLIQ. The cracked screen and non-functional arrow key could both be achieved via sufficiently rough handling. Frankly I don't know how Motorola would obtain pictures of my daughter to use as wallpaper, though I might be willing to accept this little flaw and add my own wallpaper later.

This chart shows that my T616 is the perfect phone, but I'm in a generous mood. The first person to offer me enough money for me to buy a CLIQ, takes it.

Tuesday, September 8, 2009

Infinite Arrays of Tweeples

Twitter followers list Twitter is somewhat out of the range of topics I normally cover here, but I promise we'll come around to a software development angle by the end of this post.

When you follow someone on twitter, you appear in their followers list and they appear in your following list. New entries appear at the top, so the newest follower will be the first entry in your list. Recently I noticed an exception to this sort ordering, when someone who had been following a long time ago and later unfollowed decided to follow again. I received the notification email from twitter, yet he wasn't at the top of the list of followers. Instead he appeared much further down in the listing, off the first page. That is where he was when he followed me the first time, many months ago. This entry disappeared when he unfollowed, and when he re-followed he ended up back in that same place. Why would that be?

Possibility #1: Timestamp

The first possibility, and more likely the correct one, is that twitter tracks the timestamp of every new follow and chooses not to update it on a subsequent refollow. No matter how many times you have followed/unfollowed, you retain the timestamp of the very first time and will show up in the followers list at that position.

If an unfollow+refollow was sufficient to move you back up to the top of the list of followers, the bots would do it all the time to make it more likely you'd follow them back. Yet this is a boring root cause. So lets consider a second possibility which is more illustrative to software development.

Possibility #2: Array Deletion

Twitter operates at a scale where performance optimization is essential. If they are not cognizant about performance the wheels fly off and users start to see the Fail Whale. An area of particular importance is the list of followers, as the backend infrastructure has to traverse it for every tweet. It is possible that twitter implemented the followers list as an array in memory instead of a linked list, presumably to get better locality. The classic drawback of an array is deletion: you cannot delete an element from the middle without moving all subsequent elements into the hole thus created. To avoid this compaction a "deleted" or "active" bit is commonly kept for each element, allowing deleted entries to be left in place but skipped without processing.

When Scobleizer unfollowed everyone it would have resulted in holes in the followers list of 106,000 different accounts, entries with the deleted bit set.

I suspect that Twitter does not immediately compact these arrays, so long as the ratio of holes/filled entries is tolerable. When Scobleizer decided to re-follow me the twitter backend located the earlier, deleted entry and flipped the bit back to active.

Thus the newly restored entry will re-appear in the followers list, but not as the top most entry. It will re-appear at its existing position within the array. This, or a similar implementation choice of retaining deleted entries in some way, could be why re-follows do not appear at the top of the list.

The Moral of the Story

Optimization is fine, and absolutely crucial to function at Twitter scale, but one must to be careful when an optimization changes user-visible behavior. This is particularly true for social media, where we're explicitly conversing with other humans and ascribe human motivations to their actions. Twitter's handling of deleted and re-added follows can cause considerable consternation, because to the casual observer it appears the person followed but then immediately unfollowed. It can seem judgmental.

Of course, I am most likely completely wrong about Twitter's implementation using arrays. It wouldn't be the first time I've made a complete fool out of myself in a blog post. Its cathartic, in a way. Perhaps I'll do it more often.