Monday, December 28, 2009

Crowdsourcing Backup

Broken hard driveJeff Atwood recently suffered a catastrophic loss of data of his long-running blog Coding Horror. The site was running on a virtual machine, and apparently VM backups at the hosting provider had been routinely failing for years without anybody noticing. Jeff maintained his own backups... within the VM itself, which were lost when the VM was lost. Jeff's story has a happy ending as one of his readers, Carmine Paolino, had a complete archive.

Obviously the happenstance of somebody on the Internet having a complete copy of data important to us does not constitute a practical backup strategy, but it got me to thinking about the idea of crowdsourcing backups. Everybody should have offsite backups, but practically nobody does it. Could a system be designed where each participant wanting to back up their most important data would in return offer a chunk of local disk space to use for storing data for other people?

With terabyte drives becoming common, it seems like many systems have an abundance of disk space which could be better taken advantage of. Perhaps the data you want to be backed up can be broken into chunks and stored in the free space of a number of other backup users, while your drive simultaneously stores their data.

  • Your data would have to be encrypted, as it will be stored on media controlled by random and potentially untrustworthy people.
  • A large amount of redundancy would have to be baked in, as people could drop out of the system at any time and take a chunk of stored information away. Many copies of each chunk would be stored in multiple places.
  • Forward Error Correction would also be good, to further improve survivability in the face of missing data. Recovering most of the chunks would be sufficient to reconstruct the rest.

The practicality of the details aside, with Amazon, RackSpace and others offering cloud storage options, would it even be worthwhile to construct such a crowdsourced system? In 2010, I'm not sure that it is. I suspect this is an idea whose time has come... and gone.

Wednesday, December 23, 2009

Satellites Should Respond to My Whims

There was a pretty spectacular accident in Jamaica last night, where a 737 skidded off the runway and broke into pieces. Check the picture in the linked story, I'll wait. Amazingly only two passengers were injured.

Surely I'm not the only person who immediately checked satellite imagery, on the off chance that maybe, just maybe the periodic flyover happened to be this morning. Alas, no.

View Larger Map

Monday, December 21, 2009

North Pole Compression Algorithm

Santa floppy disk ornament

Note the lack of the "HD" logo on the dust cover? Santa must have remarkable compression technology to fit the entire 1998 naughty/nice list on an 800k disk. The prevalence of popular baby names from year to year probably helps, there is a lot of duplication.

Wednesday, December 16, 2009


Slashdenfreude [slash-den-froi-duh] (noun) : To take joy in the slashdotting of others.

Monday, December 14, 2009

A Coroutine, Thread, and Semaphore Walk into a Bar...

This article about multicore programming techniques is pure comedic gold.

In particular, threads suffer badly from 'race conditions'. The race of despised worker threads is made to do boring, low status, 'background' tasks. Meanwhile, the high privilege 'system' threads get to party with the hardware. It's the same the whole world over.

It is a great read with that peculiar British humor humour which The Register is so good at. It is also a good technical overview of techniques for taking advantage of multiple cores.

Thursday, December 10, 2009

Untouchable Code

Behold: the Blaupunkt CD50.

Perhaps "behold" is too pretentious for a basic car stereo, but it is the topic of today's screed so I feel a dramatic introduction is called for. Let me call your attention to four buttons on the left side: RDS, AM, FM, and CD-C. They do what you might expect:

  • RDS - enable decode of station and song identification from an FM signal. I'm not sure why you'd ever disable this.
  • FM - switch to FM radio.
  • AM - switch to AM radio.
  • CD-C - switch to CD Changer mode. Once in CD-C mode, the RDS/FM/AM buttons have no effect until you push CD-C to get back to Radio mode.

This makes perfect sense, right? I mean really, once I'm in the CD player mode I wouldn't expect the buttons from Radio mode to do anything, would I? Yes, this is sarcasm. On the web. Dangerous, I know.

I'd speculate that the fine engineers at Blaupunkt did not actually want the user interface to be this way, and that they would have preferred the FM button to always switch to FM radio. I suspect they were presented with an existing AM/FM radio design which, for whatever reason, they could not modify. Perhaps the CD50 project had a very tight budget, or a narrow market window which didn't allow time to tweak the radio components. Less charitably, perhaps the radio design had degenerated into an unmaintainable mess and any change risked breaking the whole thing. The path of least resistance to get the product out is a mux: you're either in our new mode where we add all the shiny new goodness, or the crufty old mode where we haven't touched anything from the existing design.

The situation of an unmaintainable portion of a system should be familiar to any software developer tasked with working on a large codebase. I suspect the natural entropic state of software is unmaintainability, requiring constant infusions of energy to stave it off a while longer.

So what can we do to ensure systems remain maintainable? Unit testing is frequently suggested as an answer, though I've never been a fan of extensive unit testing. If the target platform is very different from the build system, structuring the code to be able to run unit tests is a non-trivial amount of extra work. However I've recently started working in an environment where development testing is strongly encouraged, and I have to admit it does help in keeping code maintainable as developers come and go. The lowest level unit tests are not terribly useful in this regard; even code in a complete mess will have unit tests. On the other hand, a functional testbench for a module where the interfaces to the rest of the system are mocked out is very helpful. You have a much firmer grasp of how changes you make in the module are going to impact the rest of the system. You also have more hope of being able to reimplement the module, as its interfaces are described by the mock framework. If other portions of the system reach in to the internals of the module without using the interfaces... then the cancer has already metastasized and you're probably doomed.

People say that refactoring early and often will keep code maintainable, but that requires agreement from management to spend development time paying off technical debt without an increase in marketable features. In a product environment I rarely win that argument. However, I've now worked on a complete re-implementation of two different systems where the bug load had simply become impossible due to indecipherable engineering. I suppose that is an extreme form of refactoring: extract the best bits of the old system, and throw out the rest.

Do you have tips for keeping code maintainable across multiple generations of products? This site uses Disqus for comments. You can comment anonymously if you wish, or use an existing identity like Twitter, Facebook, or any OpenID provider.

Monday, December 7, 2009

Ruminations on Nickels and Dimes

Nickels & DimesIn 1st grade I could not see how a dime could possibly be worth more than a nickel. The nickel was bigger, after all; it should obviously be worth more.

By 5th grade I realized the dime was more valuable because it was made of a more valuable metal. So even though it was smaller, its total worth was greater than the nickel.

It took until high school to figure out that both the dime and nickel are made of completely worthless metals. The dime is worth more because the US Treasury says it is worth more.

Thursday, December 3, 2009

Memory Matters

PowerPCI once worked on a system where one module was developed outside using Linux/x86 systems, brought in-house, and compiled for Linux/PowerPC. We thought we had been careful in the specifications: avoid endianness assumptions, limit memory footprint, and assume a hefty derating for the slower PowerPC used in the real system. Things looked good in initial testing, but when we started internal dogfooding the PowerPC performance dropped off the proverbial cliff. An operation that took 100 msec on the x86 development system and 300 msec during initial PowerPC testing regressed to an astonishing 45 seconds in the dogfood deployment.

The cause of this disparity was the data cache. For reasons unclear this code iterated through its configuration many, many times. On x86 the various levels of D$ comprise several megabytes, but the PowerPC had only 16K. As the dogfooding progressed and the config grew it resulted in unbelievable cache thrashing and a 2.5 order of magnitude performance drop.

Several years ago Ulrich Drepper wrote an excellent paper about all things related to memory in modern system architectures, especially x86 but relevant everywhere. It is a long read, but very worthwhile. The complete paper is available as a PDF from his site, and it was also serialized in articles on LWN.

  1. Introduction
  2. CPU caches
  3. Virtual memory
  4. NUMA systems - local versus remote references
  5. What programmers can do - cache optimization
  6. What programmers can do - multi-threaded optimizations
  7. Memory performance tools
  8. Future technologies
  9. Appendices and bibliography

I downloaded the PDF and read it over the course of a few weeks. I strongly recommend this paper, the information content is very high.

Tuesday, December 1, 2009

USPS and Red Tape

I recently mailed a package using a Pitney Bowes postage meter. After calculating the postage there is a screen listing restrictions of items which can not be mailed through the US postal system. I suspect most people just click through without reading it, which is a shame. It is a fascinating read, and is reproduced here for your edification and bemusement.

Harmful matter includes, but is not limited to:
a. All types and classes of poisons, including controlled substances.
b. All poisonous animals except scorpions mailed for medical research purposes or for the manufacture of antivenom; all poisonous insects; all poisonous reptiles; and all types of snakes, turtles, and spiders.
c. All disease germs or scabs.
d. All explosives, flammable material, infernal machines, and mechanical, chemical, or other devices or compositions that may ignite or explode.
Hazardous items includes materials such as caustic poisons (acids and alkalies), oxidizers, or highly flammable liquids, gases, or solids; or materials that are likely, under conditions incident to transportation, to cause fires through friction, absorption of moisture, or spontaneous chemical changes or from retained heat from manufacturing or processing, including explosives or containers previously used for shipping high explosives with a liquid ingredient (such as dynamite), ammunition, fireworks, radioactive materials, matches, or articles emitting obnoxious odors.

This is great stuff. I have several observations.

Cube from the movie HellraiserNote the "infernal machines" phrase in section (d). What is an infernal machine? The Pitney-Bowes restrictions appear to come directly from the US Postal Service Domestic Mail Manual C021, but the phrase is not subsequently defined there. Is it something like the puzzle boxes from Hellraiser? I can see why we might not want those to be spread around...

I had no idea the scorpion industry lobby was so powerful. In fact, I had no idea that there was a scorpion industry nor that they had a lobbyist, but they scored their own exception in section (b). Poisonous snakes need not apply, only scorpions can be mailed for medical research purposes.

In the final paragraph it explains that you're not allowed to ship explosives under any circumstances. Also, you need to ensure that containers which you previously used to ship dynamite have been cleaned of any residue. I'm sure how you could possibly have containers which were previously used to ship dynamite, if you're never allowed to ship dynamite.

Finally, in addition to the rules against shipping materials which could kill or maim note that you're not allowed to ship anything smelly or stinky. That would be gross.

Wednesday, November 25, 2009

The Kindle Firmware Hero

KindleAmazon's Kindle software version 2.3 increased battery life from 4 days to 7 days; quite an improvement. Only the Kindle 2(*) model using HSDPA saw this improvement, the Kindle DX uses an EVDO radio and still lists a 4 day battery life.

It seems likely that the Kindle 2 shipped with incomplete radio power management to meet its shipment deadline, and this update represents the completed work. Nonetheless its fun to instead contemplate the moment when some firmware engineer poring over register settings utters a prodigious "WTF!?!" upon finding something completely bogus. A few keystrokes later and voila, huge battery life improvements...

(*) Amazon Associates link

Tuesday, November 24, 2009

Mayor For Life on Foursquare

foursquare.comfoursquare is one of the early entrants in a coming wave of location-based web services. Foursquare catalogs a huge list of venues in 100 cities around the globe: restaurants, movie theaters, museums, bars, etc. You checkin with the service as you visit these places, and the system tells you tips that other foursquare users have suggested about that location. It also (optionally) broadcasts your checkin to your friends, so you can arrange meetups or just learn about new spots by watching their activities. Currently you set up your friend lists on the foursquare web site, though it does provide a way to check whether any of your twitter, facebook, or GMail contacts are using foursquare.

foursquare badgesAn interesting aspect of foursquare is the gaming angle. Badges are awarded for a huge range of activities, for example four checkins in one day earns the "crunked" badge. It looks like a drunk happy face, though in my case no alcohol was involved: Children's Discovery Museum, a local park, Fry's Electronics, and a restaurant. As with stackoverflow, badges provide a way for the developers to reward proper use of the site which doesn't cost them any money.

Finally, there is Mayorship. The person who has checked in to a venue the most in the last 60 days is declared to be the Mayor. You can steal the Mayorship away from its current holder by visiting more often, which gives the site a competitive feeling. Apparently the competition for Mayorship of hot nightspots is intense, complete with accusations of cheating. An old saying about academia springs to mind: "On foursquare, tempers run high because the stakes are so small." Nonetheless, the Children's Discovery Museum Mayorship is mine. Don't even think about trying to take it.

foursquare mayor of the Childrens Discovery Museum

A small number of business owners offer rewards to their foursquare mayor, typically on the order of a free drink. This hints at a route foursquare can take to monetize the site, by allowing businesses to reach out to patrons. The challenge will be to do this in a way that isn't creepy: a leaderboard to see how close I am to becoming Mayor would be fine, actively bugging me to visit more often would not be.

About SMS...

SonyEricsson T616The best experience using the service is with a GPS-enabled smartphone. There are free apps available for iPhone and Android, and there is a mobile-optimized website for phones with a reasonable browser. Finally, there is SMS. As I still use an ancient DumbPhone, I use SMS. One of these years, I'll buy a new phone.

foursquare is clearly aimed at people with better phones. You have to type the venue name exactly, there is no fuzzy matching. If your checkin is not recognized, there is no way to correct it after the fact on the foursquare website. This can be very frustrating. Fred Wilson wrote about the importance of including SMS support in mobile apps, both to allow someone to try the service without having to install an app and to have an answer for the entire market. Certainly in my case, I wouldn't otherwise be able to use it.

Monday, November 23, 2009


Scientists today announced the creation of a new isotope in the "island of stability" beyond Bismuth in the periodic table. It has been christened Quackulum, owing to the somewhat odd arrangement of its nucleus.

Rubber duck surrounded by electron paths

Tuesday, November 17, 2009

24.855134809027 Days

There have been issues with the autofocus on the Motorola Droid phone, which suddenly resolved themselves this morning and led to speculation of a stealth update. There is a fascinating comment in the Engadget forums by Dan Morrill (and noted in a tweet from Matt Cutts):

There's a rounding-error bug in the camera driver's autofocus routine (which uses a timestamp) that causes autofocus to behave poorly on a 24.5-day cycle. That is, it'll work for 24.5 days, then have poor performance for 24.5 days, then work again.

I suspect it is exactly 24 days, 20 hours, 31 minutes, 23 seconds, and 647 milliseconds, the amount of time for a millisecond quantity to overflow a signed 32 bit integer. This is a relatively common programming error, and one which can slip through a compressed QA schedule. In the case of the Droid, the camera was working fine while the QA team tested it and then stopped working slightly after the product shipped.

Motorola Droid

Monday, November 16, 2009

The Point of the Exercise

Spam email with no attachment
Setting up a phishing site:$25
Hiring a botnet to deliver spam:$0.0008/recipient
Forgetting to attach the malware:priceless

Thursday, November 12, 2009

Cavium Buys Montavista

A bit of news got buried by other massive acquisitions this week: Cavium Networks acquired MontaVista Software for $50 million. The offer was comprised of $16 million in cash plus $34 million in stock. It has been reported that MontaVista raised somewhere between $90 million to over $100 million from investors, but browsing the SEC Edgar Database shows $68 million. As I have no idea what I'm doing, its possible I simply missed another $20-30 million in fundraising which isn't so easily discoverable via Edgar. In particular a $3 million C round is awfully small, but that is what the paperwork shows.

  • A round: $31 million from USVP, Alloy Ventures, and James Ready (the founder) closed 5/2002. From the amendments it looks like Alloy put in $5 million of that.
  • B round: $9 million from existing investors, closed 4/2004
  • C round: $3 million from existing investors, closed 1/2005
  • D round: $21 million closed 12/2006, with Siemens Venture Capital joining as a new investor
  • also $2.7 million in 8/2009 and another $1 million in 10/2009, presumably lifeline funding leading up to the Cavium acquisition.

Fistful of DollarsWhy would investors agree to sell the company for $50 million? Presumably, they're just accepting reality. Software support businesses rarely attract venture capital, but Linux was a major buzzword for investors earlier in the decade. The trouble with support as a business model is that expenses grow linearly with revenue: as you add customers, you have to grow headcount to handle them. Expenses for a product company grow at a far slower rate, one can increase sales by 2x while increasing expenses by less than 2x.

So far as I can tell adoption of Linux in the embedded space is still growing robustly, displacing commercial RTOSes. The economic benefit of avoiding a per-unit software royalty is compelling. The expertise to bring up Linux on a new board is quite common now, companies can beef up their own teams rather than pay for support from MontaVista or Wind River.

Update: In the comments teich points out Business Review Online shows a somewhat different funding schedule:

MontaVista   9.0   Series A
MontaVista  23.0   Series B
MontaVista  28.0   Series C
MontaVista  12.0   Series D
MontaVista   3.0   individual investment
MontaVista  21.0   Series E

After the $21 million round, MontaVista appears to have taken in another $3.7 million. Altogether this matches the $100 million quotes elsewhere, though I've no idea why some of these funding events are not in Edgar.

Tuesday, November 10, 2009

Ethernet Integrity, or the Lack Thereof

Have you heard any variation of this claim?

We don't need our own integrity check. The TCP checksum is pretty weak, but Ethernet uses a ludicrously strong CRC. Even if you don't trust the TCP checksum, Ethernet will detect any errors.

Let's dig into this a bit, shall we?

Ethernet switch diagramA modern switch fabric chip is designed for both L2 ethernet switching and L3 IP routing. The additional logic for IP routing adds relatively little area in modern silicon technologies, while not having a routing capability would put them at a competitive disadvantage. Essentially all ethernet fabric chips, even those inside relatively cheap L2 switches, have the design features to route IPv4 traffic to at least a basic degree.

When a packet arrives at the input port (A) its CRC will be checked and the packet discarded if corrupt. If the packet is destined to the router's MAC address, its destination IP address will be looked up for L3 routing (C). An L3 router modifies the packet as part of its function, by decrementing the IP TTL and replacing the L2 destination with that of the next hop. Therefore a fresh CRC has to be regenerated at egress.

Even if the packet is to be switched at L2 (B), there are cases where the packet is modified. For example server machines and switch uplinks often handle multiple vlans, so their ports will be configured for tagging (D). Addition of the vlan tag requires the packet CRC to be recalculated on egress (E).

Vlan tagging

The point of this description? There are numerous cases at both L2 and L3 where a packet CRC cannot be preserved through the switch and will need to be regenerated at egress. ASIC designers hate special cases, as they add logic and test cases to the design. Because there are cases where the CRC must be regenerated, modern switch fabrics always regenerate the CRC at egress. Even if the packet has not been modified, even if the ingress CRC could have been preserved, it is discarded at ingress and regenerated at egress.

It bears repeating that this is a function of the chip, not the specific product. Even the tiny ethernet switches sold for practically nothing at retail use chips which contain basic vlan tagging and IP routing features (even if that product doesn't use them), and regenerate the CRC on every packet. 5 port ethernet switch The fabric chip they use wasn't specifically designed for such low cost switches, there is not enough profit to justify the effort. In addition to simple L2 switches those chips can be used to build NAT appliances, as the ethernet fan-out for small wireless access points, in DSL and cable routers, for low end WAN routers, etc. When only basic L2 switching is desired these fabrics can function completely standalone without a management CPU, reducing BOM cost to the bare minimum. Addition of a CPU allows the basic L3 functions to be used in the more featureful (but still low end) products.


What does this mean? The internal memories and logic paths within the switch are not covered by the ethernet CRC, it does not provide end-to-end protection. The switch might implement ECC over the whole path, but this is not common. The packet buffers are generally large enough to justify ECC, but miscellaneous FIFOs are more likely to have simple parity and logic elements often have no protection at all. It only takes one soft error to corrupt the packet contents, and then a fresh CRC will be calculated over the corrupted data.

CRC protects the wire

If you care about the data you send over the network, you should include your own integrity check at the application level. This is another good argument for using SSL: not only do you protect privacy by encrypting the data, you also get a strong end-to-end integrity check.

Monday, November 9, 2009

Knight Rider GPS

Knight Rider GPS $149 at Frys

You mean my GPS unit can sound like the authentic voice of K.I.T.T? Sign me up!

Tuesday, November 3, 2009

Delivering Apple TV Around the Planet

I am a long time Macintosh user, and though I don't generally write about Apple products I'm going to branch out a bit this time. These observations are rather obvious, but I'm going to do it anyway. Pppffftt.

iMac 27 inch
  1. Apple is reportedly pitching a $30/month subscription for TV via iTunes to content producers, to be offered sometime in early 2010.
  2. According to the iFixit teardown, the LCD in the recently announced 27" iMac is an In Plane Switching (IPS) design by LG. This provides a very wide viewing angle compared to the Twisted Nematic (TN) LCDs typically used in personal computers. IPS LCDs are generally used in television sets, as one wants to see the TV from the entire width of the couch not just one narrow section.
  3. The iMac LCD has an unusual native resolution, 2560x1440. It is exactly double in each dimension as the AppleTV, making it straightforward to optimize content for both devices.
  4. There is no TV tuner in either AppleTV or the iMac, though of course external USB tuners for broadcast HDTV are available. Nonetheless, HDTV broadcast is yesterday's technology.
  5. Apple is spending $1 billion to build a massive datacenter in Maiden, North Carolina. Certainly Apple's current MobileMe and iTunes services demand significant capacity to host them, but with such a large investment it seems likely that Apple is looking to expand into new areas and not just continue its existing services.
  6. Altogether Apple is budgeting $1.9 billion for capital expenditures in fiscal 2010, an increase from the $1.1 billion in 2009.

AppleTVSo, Apple is preparing to do to video distribution what it did in music, providing a complete solution including content, delivery, and customer device, right? Apple already provides selected video content on iTunes, and can expand from there.

Though there are a lot of clues, there is a big piece missing. If one is looking to spend billions to develop the infrastructure to become a big player in video distribution, a single massive datacenter on the US east coast is not the way to do it. It leaves you beholden to others to carry the bits around the planet, which becomes the dominant cost of the business.

Delivering Apple content

The modern Internet consists of a series of interconnected backbone networks, generally referred to as "Autonomous Systems" following the BGP terminology. Packets leave the datacenter and traverse the backbones on their way to their destination. The final telecom facility before the packet reaches its destination is called the Edge Point of Presence. Delivering media streams all the way across the internet to thousands of individual subscribers incurs significant bandwidth costs. Content Delivery Networks reduce the bandwidth costs by scattering small caching servers across thousands of POPs, serving each subscriber from the closest cache.

Apple currently relies on both Akamai and Limelight to deliver its iTunes content. Apple could certainly expand its current relationship with those vendors to carry substantially more video traffic... but for the kind of investment they are making, it seems like the money would be better spent scattering a number of smaller facilities across the planet. Peer to peer between end users seems like an ideal way to distribute video by offloading the bandwidth costs, but in practice it has not worked very well. Content distribution at scale is something which requires significant investment.

Apple has $34 billion in liquid assets, and an economic recession is a great time to buy. I'll be watching for indications that Apple is bringing this solution in house either by building lots of smaller facilities, perhaps with their own fiber network between them, or by building or acquiring a CDN of their own. Massive investment at the endpoints while remaining completely dependent on others for the connection between them doesn't make sense.

Monday, November 2, 2009

Lotus 123 for Macintosh

User Manual for Lotus 123 for Macintosh

Which is more pathetic?

  • That I actually purchased Lotus 1-2-3 for Macintosh in 1991?
  • That 18 years later, I still have the manual?

Friday, October 30, 2009


This is cute, pointed out in a tweet by Matt Cutts. There is a Halloween easter egg in Google's robots.txt file.

Disallow: /errors/
Disallow: /voice/fm/
Disallow: /voice/media/
Disallow: /voice/shared/
Disallow: /app/updates

User-agent: Kids
Disallow: /tricks
Allow: /treats


Sadly neither /tricks nor /treats actually exists.

Monday, October 26, 2009

Current GMail Ads

Here are the ads currently showing in my GMail:

Insomnia Ads in GMail

I wonder what they are trying to tell me...

Friday, October 23, 2009

ARM Cortex A5

On October 21 ARM announced a new CPU design at the low end of their range, the Cortex A5. It is intended to replace the earlier ARM7, ARM9, and ARM11 parts. The A5 can have up to 4 cores, but given its positioning as the smallest ARM the single core variant will likely dominate.

To me the most interesting aspect of this announcement is that when the Cortex A5 ships (in 2011!) all ARM processors, regardless of price range, will have an MMU. The older ARM7TDMI relied on purely physical addressing, necessitating the use of either uClinux, vxWorks, or a similar RTOS. With the ARM Cortex processor family any application requiring a 32 bit CPU will have the standard Linux kernel as an option.

Story first noted at Ars Technica

Cortex A5


In the comments Dave Cason and Brooks Moses point out that I missed the ARM Cortex-M range of processors, which are considerably smaller than the Cortex-A. The Cortex-Ms do not have a full MMU, though some parts in the range have a memory protection unit. So it is not the case that all ARM processors will now have MMUs. Mea culpa.

Monday, October 19, 2009

Tweetravenous Solutions

Recently Rob Diana wrote a guest post at WorkAwesome, Balancing Work And Social Media Addiction. In it, he said:

"If your day job is sitting in a cube or corporate office somewhere, then you will need to limit your activity in some way. If you want to be like Robert Scoble or Louis Gray, you will have to give up some sleep to stay active on several sites."

Twitter Skin Patch We at Tweetravenous Solutions have another solution to offer, which allows our customers to handle a full day's work, a full day of social media, and still get a full night's sleep! We call it the Social Media Patch. Our patented algorithms will monitor your social media feeds, collating and condensing updates from those you follow and encoding them in a complex matrix of proteins, amino acids, and caffeine. Its all infused into an easy-to-apply skin patch which can be worn inconspicuously beneath clothing. Even better, its in Real Time! (note1)

Social Media Skin patches for twitter, friendfeed, and RSS.  

Never go without your twitter fix again!

Feed your friendfeed need!

Works with any RSS feed, too! (note2)

Wait, there's more! For the true Social Media Expert wanting to get an edge on the competition, we offer additional treatment options administered in our state-of-the-art facility!
Twitter IV drip

Coming soon: two way communication! Thats right, you'll be able to retweet or reply using only the power of your mind! (and bodily secretions) (note3)


Don't wait, call now!


note1: Real Time is defined as 48 hours, to allow for processing and express overnight delivery.

note2: pubsubhubbub and rssCloud support coming soon.

note3: Two way communication will incur additional cost, and add processing time. US Postal Service will not accept bodily secretions for delivery.


Author's note: I enjoyed Rob Diana's article, and recommend people to read it. When I saw "Social Media Addiction" in its title, the thought of a Nicotine patch came spontaneously to mind. It became a moral imperative to write this up.

Also: sadly, modern browsers no longer render the <blink> tag. Too bad, it would have been sweet.


Wednesday, October 14, 2009

AMD IOMMU: Missed Opportunity?

In 2007 AMD implemented an I/O MMU in their system architecture, which translates DMA addresses from peripheral devices to a different address on the system bus. There were several motivations for doing this:

    Direct Virtual Memory Access
  1. Virtualization: DMA can be restricted to memory belonging to a single VM and to use the addresses from that VM, making it safe for a driver in that VM to take direct control of the device. This appears to be the largest motivation for adding the IOMMU.
  2. High Memory support: For I/O buses using 32 bit addressing, system memory above the 4GB mark is inaccessible. This has typically been handled using bounce buffers, where the hardware DMAs into low memory which the software will then copy to its destination. An IOMMU allows devices to directly access any memory in the system, avoiding copies. There are a large number of PCI and PCI-X devices limited to 32 bit DMA addresses. Amazingly, a fair number of PCI Express devices are also limited to 32 bit addressing, probably because they repackage an older PCI design with a new interface.
  3. Enable user space drivers: A user space application has no knowledge of physical addresses, making it impossible to program a DMA device directly. The I/O MMU can remap the DMA addresses to be the same as the user process, allowing direct control of the device. Only interrupts would still require kernel involvement.

I/O Device Latency

Multiple levels of bus bridging PCIe has a very high link bandwidth, making it easy to forget that its position in the system imposes several levels of bridging with correspondingly long latency to get to memory. The PCIe transaction first traverses the Northbridge and any internal switching or bus bridging it contains, on its way to the processor interconnect. The interconnect is HyperTransport for AMD CPUs, and QuickPath for Intel. Depending on the platform, the transaction might have to travel through multiple CPUs before it reaches its destination memory controller, where it can finally access its data. A PCIe Read transaction must then wend its way back through the same path to return the requested data.

Graphics device on CPU bus

Much lower latency comes from sitting directly on the processor bus, and there have been systems where I/O devices sit directly beside CPUs. However CPU architectures rev that bus more often than it is practical to redesign a horde of peripherals. Attempts to place I/O devices on the CPU bus generally result in a requirement to maintain the "old" CPU bus as an I/O interface on the side of the next system chipset, to retain the expensive peripherals of the previous generation.

The Missed Opportunity: DMA Read pipelining

An IOMMU is not a new concept. Sun SPARC, some SGI MIPS systems, and Intel's Itanium all employ them. Once you have taken the plunge to impose an address lookup between a DMA device and the rest of the system, there are other interesting things you can do in addition to remapping. For example, you can allow the mapping to specify additional attributes for the memory region. Knowing whether it is likely to do long, contiguous bursts or short concise updates allows optimizations to reduce latency by reading ahead, to transfer data faster by pipelining.

Without prefetch (CONSISTENT) With prefetch (STREAMING)
DMA Read with no prefetch DMA Read with prefetch

AMD's IOMMU includes nothing like this. Presumably they wanted to confine the software changes to the Hypervisor alone, whilst choosing STREAMING versus CONSISTENT requires support in the driver of the device initiating DMA, but they could have ensured software compatibility by making CONSISTENT be the default with STREAMING only used by drivers which choose to implement it.

What About Writes?

The IOMMU in SPARC systems implemented additional support for DMA write operations. Writing less than a cache line is inefficient, as the I/O controller has to fetch the entire line from memory and merge the changes before writing it back. This was a problem for Sun, which had a largish number of existing SBus devices issuing 16 or 32 byte writes while the SPARC cache line had grown to 64 bytes. A STREAMING mapping relaxed the requirement for instantaneous consistency: if a burst wrote the first part of a cache line, the I/O controller was allowed to buffer it in hopes that subsequent DMA operations would fill in the rest of the line. This is an idea whose time has come... and gone. The PCI spec takes great care to emphasize cache line sized writes using MWL or MWM, an emphasis which carries over the PCIe as well. There is little reason now to design coalescing hardware to optimize sub-cacheline writes.

Without buffering (CONSISTENT) With buffering (STREAMING)
DMA Write with no prefetch DMA Write with prefetch

Closing Disclaimer

Maybe I'm way off base in lamenting the lack of DMA read pipelining. Maybe all relevant PCIe devices always issue Memory Read Multiple requests for huge chunks of data, and the chipset already pipelines data fetch during such large transactions. Maybe. I doubt it, but maybe...

Tuesday, October 13, 2009

WinXP Network Diagnostic

Please contact your network administrator or helpdesk

Thanks, but I could probably have figured that out without help from the diagnostic utility.

Sunday, October 11, 2009

A Cloud Over the Industry


Regrettably, based on Microsoft/Danger's latest recovery assessment of their systems, we must now inform you that personal information stored on your device - such as contacts, calendar entries, to-do lists or photos - that is no longer on your Sidekick almost certainly has been lost as a result of a server failure at Microsoft/Danger. That said, our teams continue to work around-the-clock in hopes of discovering some way to recover this information. However, the likelihood of a successful outcome is extremely low.

This kind of monumental failure casts a cloud over the entire industry (pun intended). How did it happen? I do not believe Danger could have operated without sufficient redundancy and without backups for so many years, it simply does not pass the smell test. There must be more to the story.

On possible theory for unrecoverable loss of all customer data is actually sabotage by a disgruntled insider. This is, itself, pretty unbelievable. Nonetheless according to GigaOM, when Microsoft acquired Danger "...while some of the early investors got modest returns, I am told that the later-stage investors made out like bandits." I wonder if the early employees also made only a modest payout in return for their years of effort, and had to watch the late stage investors take everything. Danger filed for IPO in late 2007, but was suddenly acquired by Microsoft in early 2008. What if the investors determined they would not make as much money in a large IPO as they would in a carefully arranged, smaller acquisition by Microsoft? For example, the term sheet might have included a substantial liquidation preference but specify that all shares revert to common if the exit is above N dollars. For the investors, the best exit is (N-1) dollars; they maximize their return by keeping a larger fraction of a smaller pie. If they had drag along rights, they could force the acquisition over the objections of employees and management. If the long-time employees watched their payday be snatched away... a workforce of disgruntled employees seems quite plausible.

This is all, of course, complete speculation on my part. I simply cannot believe that the company could accidentally lose all data, for all customers, irrevocably. It doesn't make sense.

Danger Sidekick

Other links relating to this story:

Thursday, October 8, 2009


In a system with multiple network interfaces, can you constrain a packet to go out one specific interface? If you answered "bind() the socket to an address," you should read on.

Why might one need to strictly control where packets can be routed? The best use case I know is when ethernet is used as a control plane inside a product. Packets intended to go to another card within the chassis must not, under any circumstances, leave the chassis. You don't want bugs or misconfiguration to result in leaking control traffic.

The bind() system call is frequently misunderstood. It is used to bind to a particular IP address. Only packets destined to that IP address will be received, and any transmitted packets will carry that IP address as their source. bind() does not control anything about the routing of transmitted packets. So for example, if you bound to the IP address of eth0 but you send a packet to a destination where the kernel's best route goes out eth1, it will happily send the packet out eth1 with the source IP address of eth0. This is perfectly valid for TCP/IP, where packets can traverse unrelated networks on their way to the destination.

In Linux, to control the physical topology of communication you use the SO_BINDTODEVICE socket option.

#include <netinet/in.h>
#include <net/if.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/socket.h>

int main(int argc, char **argv)
    int s;
    struct ifreq ifr;

    if ((s = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
        ... error handling ...

    memset(&ifr, 0, sizeof(ifr));
    snprintf(ifr.ifr_name, sizeof(ifr.ifr_name), "eth0");
    if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE,
                (void *)&ifr, sizeof(ifr)) < 0) {
        ... error handling ...

SO_BINDTODEVICE forces packets on the socket to only egress the bound interface, regardless of what the IP routing table would normally choose. Similarly only packets which ingress the bound interface will be received on the socket, packets from other interfaces will not be delivered to the socket.

There is no particular interaction between bind() and SO_BINDTODEVICE. It is certainly possible to bind to the IP address of the interface to which one will also SO_BINDTODEVICE, as this will ensure that the packets carry the desired source IP address. It is also permissible, albeit weird, to bind to the IP address of one interface but SO_BINDTODEVICE a different interface. It is unlikely that any ingress packets will carry the proper combination of destination IP address and ingress interface, but for very special use cases it could be done.

Monday, October 5, 2009

Thursday, October 1, 2009

Yield to Your Multicore Overlords

ASIC design is all about juggling multiple competing requirements. You want to make the chip competitive by increasing its capabilities, by reducing its price, or both. Today we'll focus on the second half of that tradeoff, reducing the price.

Chip fabrication is a statistical game: of the parts coming off the fab, some percentage simply do not work. The vendor runs test vectors against the chips, and throws away the ones which fail. This is called the yield, and is the primary factor determining the cost of the chip. If the yield is bad, so you have to fab a whole bunch of chips to get one that actually works, you have to charge more for that one working chip.

To illustrate why most chips coming out of the fab do not work, I'd like to walk through part of manufacturing a chip. This information is from 1995 or so, when I was last seriously involved in a chip design, and describes a 0.8 micron process. So it is completely old and busted, but is sufficient for our purposes here.

Begin by placing the silicon wafer in a nitrogen atmosphere. You deposit a photo-resist on the wafer, basically a goo which hardens when exposed to ultraviolet light. You place a shadow mask in front of a light source; the regions exposed to light will harden while those under the shadow mask remain soft. You then chemically etch off the soft regions of the photo-resist, leaving exposed silicon where they were. The hardened regions of photo-resist stay put.

Next you heat the wafer to 400 degrees and pipe phosphorous into the nitrogen atmosphere. As the K atoms heat they begin moving faster and bouncing off the walls of the chamber. Some of them move fast enough that when they strike the surface of the wafer they break the Si crystal lattice and embed themselves in the silicon. If they strike the hardened photo-resist, they embed themselves in the resist; very, very few are moving fast enough to crash all the way through the photoresist into the silicon underneath.

Electron Microscopy image of a silicon die

Next you use a different chemical process to strip off the hardened photoresist. You are left with a wafer which has phosphorous embedded in the places you wanted. Now you heat the wafer even higher, hot enough that the silicon atoms can move around more freely; they move back into position and reform the crystal lattice, burying the phosphorous atoms embedded within. This is called annealing. Now you have the p+ regions of the transistors.

You repeat this process with aluminum ions to get the n- regions. Now you have transistors. Next you connect the transistors together with traces of aluminum (which I won't go into here). You cut the wafer to separate the die, and place each die in a package. You connect bonding wires from the edge of the die to the pins of the chip. And voila, you're done.

It should be apparent that this is a probabalistic process. Sometimes, based purely on random chance, not enough phosphorous atoms embed themselves into the silicon and your transistors don't turn on. Sometimes too much phosphorous embeds and your transistors won't turn off. Sometimes the Si lattice is too badly damaged and the annealing is ineffective. Sometimes the metal doesn't line up with the vias. Sometimes a dust particle lands on the chip and you deposit metal on top of the dust mote. Sometimes the bonding doesn't line up with the pads. Etc etc.

This is why the larger a chip grows, the more expensive it becomes. Its not because raw silicon wafers are particularly costly, its that the probability of there being a defect somewhere grows ever greater as the die becomes larger. The bigger the chip, the lower the yield of functional parts.

Intel Nehalem chip with SRAM highlited For at least 15 years that I know of, chip designs have improved their yield using redundancy. The earliest such efforts were done in on-chip memory: if your chip is supposed to include N banks of SRAM, put N+1 banks on the die connected with wires in the top most layer which can be cut using a laser. The SRAM occupies a large percentage of the total chip area, statistically it is likely that defects will be within the SRAM. You can then cut out the defective bank, and turn a defective chip into one that can be used. More recent silicon processes use fuses blown by the test fixture instead of lasers.

Massively Multicore Processors

Yesterday NVidia announced Fermi, a beast of a chip with 512 Cuda GPU cores. They are arranged in 16 blocks of 32 cores each. At this kind of scale, I suspect it makes sense to include extra cores in the design to improve the yield. For example, perhaps each block actually has 33 cores in the silicon so that a defective core can be tolerated.

In order to avoid having weird performance variations in the product, the extra resources are generally inaccessible if not used for yield improvement. That is even though the chip might have 528 cores physically present, no more than 512 could ever be used.

NVidia Fermi GPU

Monday, September 28, 2009

49.710269618056 Days

Western Digital recently corrected a firmware issue in certain models of VelociRaptor where the drive would erroneously report an error to the host after 49 days of operation. Somewhat inconveniently for RAID arrays, if all drives powered on at the same time they would all report an error at the same time.

Informed speculation: the drive reports an error after exactly 49 days, 17 hours, 2 minutes, 47 seconds, and 294.999 milliseconds of operation. That is the moment where a millisecond timer overflows an unsigned 32 bit integer.

WD VelociRaptor

Tuesday, September 22, 2009

A Pudgier Tux

Tux the Penguin At LinuxCon 2009 a discussion arose about the Linux kernel becoming gradually slower with each new release. "Yes, it's a problem," said Linus Torvalds. "The kernel is huge and bloated, and our icache footprint is scary. I mean, there is no question about that. And whenever we add a new feature, it only gets worse."

I think the addition of new features is a red herring, and the real problem is in letting Tux eat the herring. Just hide the jars, maybe get a treadmill, and everything will go back to the way it was.

Pickled Herring

Story originally noted in The Register.

Thursday, September 17, 2009

Jasper Forest x86

Intel has a long but uneven history in the embedded market. In the early days of the personal computer Intel released the 80286 as a followon to the original 8086. There actually was an 80186: it was a more integrated version of the 8086 aimed at embedded applications. Intel's interest in embedded markets has waxed and waned over the years, but it is an area where Intel still has room for significant growth.

I wrote about x86 for embedded use about a year and a half ago, with four main points:

  • Volume Discounts
    PC pricing thresholds at 50,000 units have to be rethought for a less homogenous market
  • System on Chip (SoC)
    Board space is at a premium, we need fewer components in the system
  • Production lifetime
    These systems are not redesigned every few months, chips have to remain in production longer
  • Power and heat
    Airflow is more constrained, and the system has other heat generating components besides the CPU complex
Nehalem vs Jasper Forest

At the Intel Developer Forum next week Intel is expected to focus on embedded applications for its products. In advance of IDF Intel announced the Jasper Forest CPU, a System on Chip version of Nehalem. It is based on a 1, 2, or 4 core CPU plus an integrated PCI-e controller, so it does not need a separate northbridge chip. Intel also committed to a 7 year production lifetime, allowing the part to be designed into products which will remain on the market for a while. I'd speculate that Intel will offer industrial temperature grade parts as well, perhaps at lower frequencies.

Jasper Forest is particularly suited for and aimed at storage applications. It has additional hardware for RAID support (presumably XOR & ECC generation), and a feature to use main memory as a nonvolatile buffer cache. When loss of power is detected the chip will flush any pending writes out to RAM and then set the DRAM to self-refresh before shutting down. By including a battery sufficient to power the DRAM, the system can avoid the need for a separate nonvolatile data buffer like SRAM.

This is a good approach for Intel: target silicon at specific high margin, growing application areas. Go for markets with moderate power consumption requirements, as x86 is clearly not ready for small battery powered applications like phones. Ars Technica discusses Intel's upcoming weapon for getting into mobile and other battery powered markets, a version of their 32nm process which reduces leakage current to almost nothing. An idle x86 would consume essentially no power, which would be huge.