Showing posts with label bestof. Show all posts
Showing posts with label bestof. Show all posts

Thursday, December 2, 2010

Engineering in a Small World

I currently work in a relatively large development team. As is the case with every team of that size, we are organized as one enormous group where everybody works with everybody else, every day. I've graphed out our team interactions. I'm sure it looks a lot like your team, right?

fully connected graph of 20 people

loosely connected group of 20 people Wait: does that sound weird, based on your experience? You're right, I made it all up. We're not organized as one enormous group, we're grouped into smaller teams like everybody else. Yet to a degree, the larger group has to be able to coordinate between every single person, every day. How is this accomplished?

Even in a relatively small group of people, a certain pattern emerges. Most individuals in the group interact with a small number of others, but a few are far more highly connected and routinely interact with dramatically more. These connectors result in enormous groups, loosely coupled. This is the phenomena which leads to the six degrees of separation theory, that on average any two people on the planet can be connected by six friends of friends. This pattern is also the basis of the six degrees of Kevin Bacon, who is one of those "highly connected" nodes in the graph of film actors.


 
The Small World Pattern

This phenomena is called the Small World pattern. I first read about it in Here Comes Everybody.

Cover of Here Comes Everybody

Here Comes Everybody, chapter 9.
... the chance that you know [a highly connected person] is high. And the "knowing someone in common" link - the thing that makes you exclaim "Small World!" with your seat mate - is specifically about that kind of connection.


The Small World Pattern seems obvious, in hindsight. Of course some people are simply more social and outgoing that others. They make an effort to meet people. They form connections. They are far more connected to other people than most.

The rest of this musing will concern the Small World Pattern in engineering organizations.


 
The Small World Scoffs at Your Orgchart

Connections can be forced, organizationally: a regular meeting between tech leads from related projects, for example. Connections can also happen by happenstance, as when members of different teams work at adjacent desks. However, the strongest connections happen because some percentage of the engineering population wants to be connected. They are outgoing, and enjoy talking to people outside their immediate coworkers. These connections are far more persistent, and likely to survive past the end of any particular project or recurring meeting.


 
No Group is an Island, but Some are Peninsulas

Something which can happen in a large company: you work on an infrastructure project which should be applicable in a number of different areas, yet never seems to get the attention you think it deserves. Other groups which could leverage your work instead do their own thing, and later only grudgingly evaluate your system before pronouncing it unfit. Is it because you've misunderstood their requirements? Is it because they think your implementation is poor?

More likely, its because you lack connections from your group to others. It takes just one person in the right place at the right time to say "we should go talk to John on Project Foo." When these suggestions are made organically and at the right time, they are far more likely to be acted upon. When such a suggestion comes as an edict way after the decision point, such as via some recurring meeting, it is far less likely to be received favorably.


 
To the Connector Go the Spoils

Being highly connected within an engineering organization reaps many rewards. People associate them with the good outcomes of serendipitous introductions.

Being highly connected within an engineering organization also suffers some downsides. I wish I understood the psychological reason why, but nonetheless it happens: Technical competence as an individual contributor will be questioned more often if you spend significant time interacting with other groups. Its weird.


 
Closing Thoughts

Engineers are human, though in your daily work it might not always seem so. Understanding human behavior is as important in our field as in any other. I highly recommend Shirky's Here Comes Everybody, and his subsequent Cognitive Surplus. Both are excellent.

Thursday, October 28, 2010

Toward A Faster Web: Increase the Speed of Light

fiber optic cross section Speed Limit 202,700 km/sec Fiber optic strands have a central core of material with a high refractive index surrounded by a jacket of material with a slightly lower index. The ratio of the two is set to cause total internal reflection, where the light is confined to the central region and won't diffuse out into the cladding.

The refractive index is a measure of the speed of light in a medium. The speed of light in vacuum is 300,000 kilometers per second, which is defined as an index of 1. The core of a typical fiber optic cable has an index of 1.48, so the speed of light there is (300,000/1.48) = 202,700 kilometers per second.


 

Impact

It is roughly 8,200 kilometers from Tokyo to San Francisco.

transpacific fiber map

The round trip time through transpacific fibers due solely to speed of light is roughly (2 * 8,200 km / 202,700 km/sec) = 81 milliseconds. Fibers do not run directly from the San Francisco Bay to the Tokyo harbor, so the actual distance is somewhat longer. Traceroute across the NTT network shows the round trip across the ocean is about 100 msec. A small portion of this is FIFO delay in regenerators along the ocean floor and queueing delay in switches at either end. Another portion is software overhead, as traceroute is handled in the slowpath of typical routers. The rest is the time it takes for light to propagate across the span.

7  ae-7.r20.snjsca04.us.bb.gin.ntt.net (129.250.5.52)  50.115 ms
   ae-8.r21.snjsca04.us.bb.gin.ntt.net (129.250.5.56)  51.020 ms
   ae-7.r20.snjsca04.us.bb.gin.ntt.net (129.250.5.52)  50.165 ms
8  as-0.r21.tokyjp01.jp.bb.gin.ntt.net (129.250.5.82)  154.821 ms
   as-2.r20.tokyjp01.jp.bb.gin.ntt.net (129.250.2.35)  147.516 ms  153.187 ms

 

Suggestion

Speed Limit 222,970 km/sec

100 Gigabit Ethernet is nearly done, with products already available on the market. Research into technologies for Terabit links is ramping up now, including one at UCSB which triggered this musing. Dan Blumenthal, a UCSB professor involved in the effort, said that new materials for the fiber optics might be considered: "We won't start out with that, but it'll move in that direction," (quoting from Light Reading).

Fiber with a 10% lower refractive index would increase the speed of light in the medium by 10%. It would decrease the round trip time across the Pacific from ~100 msec to ~90 msec. One of my favorite Star Trek lines is from Déjà Q, a casual suggestion to "Change the gravitational constant of the universe." This is a case where we can make the web faster by changing the speed of light, though we need only do so within fiber optic cables and not the entire universe.


 

Practicalities

I admit that I have absolutely no understanding of the chemistry involved in fiber optics. Silica is doped with compounds to get the desired properties, including some which raise or lower the refractive index. There are tradeoffs between clarity/lossiness, dispersion, and refractive index which I don't understand. However I think its important to properly weigh the value of lowering the refractive index: it makes the web faster. We can do a lot with caching content locally and distributing datacenters around the planet, but in the end sometimes bits need to go off to find the original source no matter where it might be.

Also to state it clearly this consideration is only applicable to long range lasers, with a reach in tens of kilometers. The initial Terabit Ethernet work will almost certainly be on short range optics for use within facilities, where the propagation delay is insignificant compared to other delays in the system. Its more important to optimize the power consumption and cost of short range lasers than to worry about microseconds of delay. Long reach optics have different constraints, and there we have a once-in-a-generation opportunity to make wide area networks faster.

Wednesday, August 18, 2010

x86 vs ARM Mobile CPUs

The ARM architecture dominates mobile computing. It is used in all popular mobile phones and in a huge percentage of battery powered devices generally. This is due partly to its good overall performance, but especially due to its performance per watt expended. ARM chips consume very little power when compared to x86, and ARM's power consumption still excels even when compared to other RISC chips. At one time even Intel manufactured ARM chips, the result of its purchase of the DEC semiconductor business and its excellent StrongARM design. In 2006 Intel sold its ARM products to Marvell Semiconductor, committing to x86 for every segment of the computing market.

Its easy to assume that this state of affairs will continue, and that Intel will never successfully compete in the mobile market. I suspect that is too simplistic an assumption. There are two main sources of power dissipation in modern microprocessors: the power consumed by transistors actively switching, and the power lost to leakage current.

active current, leakage current into substrate
x86 vs ARM: Active Power

It requires power to switch a CMOS transistor 0->1 or 1->0, so one way to reduce power consumption is to have fewer transistors and to switch them at a lower frequency. x86 is at a disadvantage here compared to ARM, which Intel and AMD's design teams have to cover with extra work and cleverness. The vagaries of the x86 instruction set burdens it with hardware logic which ARM does not require.

  • Since the Pentium Pro, Intel has decoded complex x86 instructions down to simpler micro-ops for execution. AMD uses a similar technique. This instruction decode logic is active whenever new opcodes are fetched from RAM. ARM has no need for this logic, as even its alternate Thumb encoding is a relatively straightforward mapping to regular ARM instructions.
  • x86_32 exposes only a few registers to the compiler. To achieve good performance, x86 CPUs implement a much larger number of hardware registers which are dynamically renamed as needed. ARM does not require such extensive register renaming logic.
  • Every ARM instruction is conditional, and simple if-then-else constructs can be handled without branches. x86 relies much more heavily on branches, but frequent branches can stall the pipeline on a processor. Good performance in x86 requires extensive branch prediction hardware, where ARM is served with a far simpler implementation.

x86 vs ARM: Leakage Current

Intel Nehalem processor dieLeakage current became a significant contributor to power consumption in 2003 with the move from 0.18 to 0.13 micron feature sizes, and has become more significant in each subsequent generation. The industry is now moving into 0.032 micron technologies.

A capacitor is formed when two conductive materials are separated by an insulator, called the dielectric. The capacitance is determined by the quality of the insulating material, quantified by the dielectric constant k. Higher k means more capacitance. "Leakage" is current which is able to flow out of the ASIC transistors and into the silicon substrate. To reduce the current leaking out, one needs to make a better dielectric between the transistor and the bulk of the silicon. This is generically referred to as high-k silicon technology.

As we're now talking about silicon fabrication techniques, we have to start talking about Intel specifically rather than the x86 architecture in general. Intel began using a high-k dielectric in production in 2007, during the 45 nm generation of parts. The rest of the industry has been experimenting with such materials, but is only now rolling it into the 32 nm generation. Intel hasn't stopped working on the technique, their 32 nm process benefits from the last several years of experience.


x86 vs ARM: Predicting The Future

Leakage current becomes more significant with each generation of process technology. The power consumed by actively switching transistors has been radically reduced over the last few years, leaving leakage as the more significant source of current consumption. It is difficult to estimate how serious the effect is, but this article from March 2008 shows leakage current starting out relatively insignificant in 180 nm silicon but growing to nearly 40% of total power consumption in a 50 nm process.

So far as I can see, this trend will continue. Leakage current will soon become the dominant factor in CPU power consumption. In fact, in 32 nm processes it might already be the primary factor. This is where the game changes: the advantage for total power consumption shifts away from the efficiency of the CPU architecture and design, and to the process technology of the fab. Presumably, this trend informed Intel's decision to sell their ARM assets to Marvell: there is little reason to enrich a competitor if the advantages of doing so will diminish over time.


There is still room for clever design, of course. To reduce active power consumption, processor designs have long stopped the clock to unused portion of the CPU. To reduce leakage current, AMD is taking the next step to actually remove the power supply to those portions of the CPU. For ARM, that design choice makes even more sense. ARM has no control over the fab, their designs have to minimize assumptions about the underlying silicon technology.

Right now ARM reigns supreme in the mobile space, but the strengths which gave it an advantage over x86 are rapidly becoming less compelling. Having to compete directly on silicon process sophistication moves the game onto Intel's turf, which Intel is happy to capitalize on with its Medfield platform. Its a great time to be in the mobile space.

Wednesday, July 28, 2010

Exploring libjit

Continuing investigation into JIT compilation for virtual machines, today we will delve into some of the plumbing. libjit is a library for implementing JIT compilers in a platform independent way. A virtual machine calls functions within libjit to emit abstract, low level operations such as arithmetic computation or branches. libjit will handle translation to the native opcodes of the platform on which it is running. x86 and ARM are currently supported, with other CPU architectures falling back to a slower interpreted path. libjit was created for DotGNU portable.NET by Rhys Weatherley, then the lead developer of the project. It is now used in the Mono project, an open source effort to implement the .NET CLR.

To investigate libjit we'll turn to the Official Subroutine for Programming Language Evaluation on the Internet, Fibonacci numbers. We'll start with an iterative C implementation:

unsigned long long fib (unsigned int n) {
  if (n <= 2) return 1;
  unsigned long long a = 1, b = 1, c;
  do {
    c = a + b;
    b = a;
    a = c;
    n--;
  } while (n > 2);
  return c;
}

libjit is designed to be driven from an Abstract Syntax Tree description of an algorithm. By descending through the AST outputting JIT operations at each node, you can construct native instructions for that algorithm. Had we written fib() in an interpreted language, we'd start with the AST within the interpreter. For the purposes of this example I've skipped the AST, manually constructing a fibonacci number routine by translating each line of C source code into JIT calls. The code is below, for your edification and bemusement.

jit_function_t create_fib_jit (jit_context_t *context) {
  jit_type_t param;
  jit_type_t signature;
  jit_function_t function;
  jit_value_t n, a, b, c;
  jit_value_t constant, compare, n_minus_1, a_plus_b;
  jit_label_t label1 = jit_label_undefined;
  jit_label_t label2 = jit_label_undefined;

  jit_init();
  *context = jit_context_create();
  jit_context_build_start(*context);

  /* Build the function signature, fib(unsigned int) */
  param = jit_type_uint;
  signature = jit_type_create_signature(jit_abi_cdecl, jit_type_ulong,
                                        &param, 1, 1);
  function = jit_function_create(*context, signature);

  /* Begin emitting instructions */

  /* "if (n <= 2)" */
  n = jit_value_create(function, jit_type_uint);
  n = jit_value_get_param(function, 0);
  constant = jit_value_create_nint_constant(function, jit_type_uint, 2);
  compare = jit_insn_le(function, n, constant);
  jit_insn_branch_if_not(function, compare, &label1);

  /* "return 1;" */
  constant = jit_value_create_nint_constant(function, jit_type_ulong, 1);
  jit_insn_return(function, constant);

  /* "if (n <= 2)" else branches here */
  jit_insn_label(function, &label1);

  /* "unsigned long long a = 1, b = 1, c;" */
  constant = jit_value_create_nint_constant(function, jit_type_ulong, 1);
  a = jit_value_create(function, jit_type_ulong);
  jit_insn_store(function, a, constant);
  b = jit_value_create(function, jit_type_ulong);
  jit_insn_store(function, b, constant);
  c = jit_value_create(function, jit_type_ulong);

  /* "do {" */
  jit_insn_label(function, &label2);

  /* "c = a + b;
   *  b = a;
   *  a = c;" */
  a_plus_b = jit_insn_add(function, a, b);
  jit_insn_store(function, c, a_plus_b);
  jit_insn_store(function, b, a);
  jit_insn_store(function, a, c);

  /* "n--;" */
  constant = jit_value_create_nint_constant(function, jit_type_uint, 1);
  n_minus_1 = jit_insn_sub(function, n, constant);
  jit_insn_store(function, n, n_minus_1);

  /* "} while (n > 2);" */
  constant = jit_value_create_nint_constant(function, jit_type_uint, 2);
  compare = jit_insn_gt(function, n, constant);
  jit_insn_branch_if(function, compare, &label2);

  /* "return c;" */
  jit_insn_return(function, c);

  if (jit_function_compile(function) == 0) {
    printf("compilation error occurred\n");
  }
  jit_context_build_end(*context);
  return function;
}

libjit compiles this function to native opcodes. Running on x86, the resulting assembly looks a great deal like the series of jit_insn_* calls which generated it.

   push   %rbp
   mov    %rsp,%rbp
   sub    $0x20,%rsp
   mov    %r12,(%rsp)
   mov    %r13,0x8(%rsp)
   mov    %r14,0x10(%rsp)
   mov    %r15,0x18(%rsp)
   mov    %rdi,%r15
   cmp    $0x2,%edi
   ja     A
   mov    $0x1,%eax
   jmpq   C
A: mov    $0x1,%r14d
   mov    $0x1,%r13d
B: mov    %r14,%r12
   add    %r13,%r12
   mov    %r14,%r13
   mov    %r12,%r14
   dec    %r15d
   cmp    $0x2,%r15d
   ja     B
   mov    %r12,%rax
C: mov    (%rsp),%r12
   mov    0x8(%rsp),%r13
   mov    0x10(%rsp),%r14
   mov    0x18(%rsp),%r15
   mov    %rbp,%rsp
   pop    %rbp
   retq   

How does it perform compared to the C version? I ran each in a loop of 1,000,000 iterations computing fib(75). The function was run once outside the timing loop, to avoid cache miss effects. The C code was compiled with -O0 and -O2, which made a huge difference in the results. Invoking gcc -O3 resulted in slower code due to overly aggressive loop unrolling. Similarly, libjit can set an optimization level via jit_function_set_optimization_level(), though it made no measurable difference in the results.

C -O0: 367.2 nsecs/iteration
libjit: 344.8 nsecs/iteration
C -O2: 110.1 nsecs/iteration

In this example libjit achieved the speed of a naive C compilation. This makes sense, it is the result of a naive programmer translating C code.

The point of this series of articles is the promise, not necessarily the current reality. The great promise of JIT compilation as compared to static is the ability to make optimizations based on profiling, even specializing routines for specific input values. The routine can sanity check its inputs, and escape back into the interpreter if the input does not match expectations. This would be of great benefit in large code bases, where we frequently have variables which are essentially constant. For example in a web application the language setting is a variable, though any individual user essentially never changes their language setting in the middle of a session. We'll explore this idea more in the future.

Thursday, July 8, 2010

Virtual Instruction Sets: Opcode Arguments

Its a virtual CPU fan, get it?Virtual Machine architectures are a fascinating topic, and one that I plan to occasionally explore in this space. Not virtual machines in the sense of VMWare or Xen, rather the runtime environment for a programming language like Java or Python. This time we'll focus on the structure of the instruction set, in particular on how operands are passed and stored. Why are these low level details important?

  • Traditional compilers emit instruction sequences without knowing anything about the specific CPU model, system configuration, or input data to be processed.
  • The compiler can optimize for a specific CPU pipeline, and maybe even produce multiple binaries for different CPUs. As a practical matter you cannot produce a large number of variations due to the sheer size of the final binary image.
  • Profile-driven compilation can optimize for representative data you supply during the build phase, but representative data is always a guess and a compromise. Also as a practical matter, its difficult to use profile-driven optimization for many applications, such as GUIs.
  • Only a JIT for a virtual machine has the luxury of knowing the specific CPU, system configuration, and has profiling information from the current input data.

The hardware CPU architectures we use now have evolved in lockstep with compiler technology, and mostly C/C++ compilers at that. They have enormous I$ and D$ because the compiler cannot predict very much about what it will execute next. The hardware has extensive branch prediction logic and history tracking because compilers emit an average of one branch every 7 instructions.

Virtual Machines change everything: by profiling the running code they can produce instructions for this specific workload, resulting in long sequences of very predictable opcodes without branches or conditionals. It has the potential to change hardware architectures, once we pass a tipping point where most of the workload runs within a VM. I suspect this tipping point will be reached in mobile devices well before it impacts workstations, laptops, or servers.

This rosy prediction is by no means certain. The JIT for most current VMs will compile a function the first time it is used. They can optimize for the CPU and possibly even take memory size into account, but they don't use any profiling information. Thus the JIT can potentially get the benefit of compiling for the specific CPU pipeline on which it runs, though in practice even this isn't typically done. So far as I know of the VMs discussed here only Mozilla's Spidermonkey makes use of tracing to produce specifically optimized routines according to the input data being processed.

We're going to examine seven virtual machines, focussing on how operands are passed: the JVM, CLR, Spidermonkey, LLVM, Parrot, V8, and Dalvik.


 
JVM & CLR

JVM argument stackThe Java Virtual Machine and the Common Language Runtime used by .Net are certainly very different, but as virtual machines go they have a lot in common. Both are stack based: operands to an instruction are popped from the stack, and the result is pushed.

Stack based virtual machines are relatively common, because they are conceptually very simple. Indeed many early microprocessors and microcontollers were stack based, because the silicon technology of the day wouldn't allow a CPU with a generous number of registers on the die. In that sense virtual CPUs are following the same evolutionary path as hardware CPUs did several decades ago, starting with stack based machines and adding registers later.

Stack-based instruction sets tend to have a very high code density, because their opcodes don't need to encode source and destination register numbers. When the JVM was developed in the early 1990s, processor caches were measured in the tens of kilobytes. A densely packed bytecode was a big advantage, far more bytecode could be stored in the hardware CPU's data cache.


 
Spidermonkey (Firefox)

SpiderMonkey is the Javascript engine in Firefox, and is a stack-based machine like the JVM and CLR. What I find most interesting about SpiderMonkey is that it tackled profile-driven JIT optimization first, via TraceMonkey in the latter part of 2008. A more conventional method-compiling JIT came later, via JaegerMonkey in early 2010. The virtue of doing things in this order is pretty compelling: tracing, when it works, can deliver spectacular gains. However tracing really only helps with loops, leaving lots of low hanging fruit for a method-based JIT. Doing the method-based JIT first makes it more difficult to get the profiling information which tracing needs. By doing TraceMonkey first, its instrumentation needs became part of the requirements for JaegerMonkey.


 
LLVM

The primary design point of the LLVM project is a compiler toolchain, and the LLVM instruction set was designed to be the intermediate representation between the language-specific frontend and more generic backend. The LLVM instruction set defines a register based virtual machine with an interesting twist: it has an infinite number of registers. In keeping with its design point as a compiler intermediate representation, LLVM registers enable static single assignment form. A register is used for exactly one value and never reassigned, making it easy for subsequent processing to determine whether values are live or can be eliminated.


 
Parrot

Parrot is also a register based virtual machine. It defines four types of registers:

  1. Integers
  2. Numbers (i.e. floating point)
  3. Strings
  4. Polymorphic Containers (PMCs), which reference complex types and structures

Like LLVM, Parrot does not define a maximum number of registers: each function uses as many registers as it needs. Functions do not re-use registers for different purposes by storing their values to memory, they specify a new register number instead. The Parrot runtime will handle assignment of virtual machine registers to CPU registers.

So far as I can tell, integer registers are the width of the host CPU on which the VM is running. A Parrot bytecode might find itself using either 32 or 64 bit integer registers, determined at runtime and not compile time. This is fascinating if correct, though it seems like BigNum handling would be somewhat complicated by this.


 
V8 (Chrome)

V8 is the JavaScript engine in the Chrome browser from Google. Its a bit of a misnomer to call V8 a virtual machine: it compiles the Javascript source for a method directly to machine code the first time it is executed. There is no intermediate bytecode, and no interpreter. This is an interesting design choice, but for the purposes of this article there isn't much to say about V8.


 
Dalvik (Android)

Dalvik virtual machine registersDalvik is the virtual machine for Android application code. The Dalvik instruction set implements an interesting compromise: it is register based, but there are a finite number of them as opposed to the theoretically infinite registers of LLVM or Parrot. Dalvik supports 65,536 registers, a vast number compared to hardware CPUs and presumably sufficient to implement SSA (if desired) in reasonably large functions.

Even more interestingly, not all Dalvik instructions can access all registers. Many Dalvik instructions dedicate 4 bits to the register number, requiring their operands to be stored in the first 16 registers. A few more instructions have an 8 bit instruction number, to access the first 256. There are also instructions to copy the value to or from any of the 65,536 registers to a low register, for a subsequent instruction to access.

It took a while to understand the rationale for this choice, and I'm still not confident I fully get it. Clearly the Dalvik designers believe that keeping data in one of the high registers will be faster than explicitly storing it to memory, even if the vast number of registers end up mostly residing in RAM. Addressing data as register numbers instead of memory addresses should make it easier for the VM to dynamically remap Dalvik registers to the real hardware registers. For example, if it can predict that virtual register 257 will likely be used in the near future it can be kept in a CPU register instead of being immediately stored to memory.


 
Other VMs

There are many, many more virtual machine implementations beyond the ones implemented here. The Python, Smalltalk, and Lua programming languages each have their own VM instruction set and implementation. Erlang started with a VM called JAM, and later reimplemented the underpinnings in a new virtual machine called BEAM. Adobe Flash has a VM which has been open sourced and donated to the Mozilla project as Tamarin. Wikipedia lists brief descriptions of a number of current VMs.

Tuesday, July 6, 2010

The New Intelligence Agency: All of Us

Yesterday Louis Gray pieced together vague snippets of information from tweets made by the founders and investors of Foursquare and Brizzly to speculate that Foursquare was negotiating to buy Brizzly. Later that day the speculation was denied by all parties involved. To (loosely) quote Mandy Rice-Davies: "They would deny it, wouldn't they?"   ... I don't think this story is over.

Nonetheless the details of the speculated transaction are not our topic today. Instead, I'd like to consider the process which led to it. For decades government (and sometimes corporate) intelligence operations have had access to reams of communications data from which to make inferences. They could see who was calling whom, where letters and packages were being delivered, and know people's movements to some extent via airline manifests. Intelligence agencies are famous for collecting massive amounts of information and using algorithms to look for patterns, to be followed up by a human analyst.

We're rapidly moving into a world where a significant amount of that information is available to anyone who cares to look for it. We're using social networks which broadcast our updates publicly, either deliberately or because we don't understand the privacy settings. We're rapidly integrating location data into online applications, which people willingly share if they see a benefit from it. As the tweets Louis quoted show, people also love making coy hints about their dealings, secure in the knowledge that nobody will figure out such a vague hint. Yet given enough vague data, particularly if one is aware of existing connections between the participants, correlations can be found. Certainly there will be false positives, but there will also be some real gems.

Systematic data mining of social networks, both their contents and the metadata they contain, in order to gain competitive advantage has enormous implications. It apparently is already happening: military raids have been cancelled due to leaks on social networks, showing that government agencies are concerned about the possibility. For the most part it won't be reported on, and will become just another part of the Internet underpinnings.

Friday, May 14, 2010

Uncanny Friending

There is an urban legend that Eskimos have many different words for snow. The truth is the Aleut languages have about as many words for snow as does English, but allow descriptive suffixes to be attached to any word to form countless variations.

Consider the English words we use to describe human relationships, and the distinctions they convey in meaning:

sisterstepsisterhalf sister
significant otherfiancéespouse
friendjust friendsfriend with benefits
peercoworkercolleague
motherstepmothergodmother

We use adjectives to add huge amounts of information in a single word. "fiancée" conveys one meaning, that of a beloved person. "current fiancée" conveys an entirely different meaning, a disposable relationship given a label for convenience.

Now consider the words we use to describe relationships in social networks:

friendfriendfriend
friendfriendfriend
friendfriendfriend
friendfriendfriend
friendfriendfriend

Why do we find this unsatisfying? I believe it is a corollary to the Uncanny Valley effect in robotics and computer games: "friend" is close enough to the real description of the human relationship that we find it unsettling. If the term were more inhuman, less shaded with meaning, it would not be so maddening.

The term "like" has a similar problem: who wants to like something unpleasant or unsavory? Clicking "like" is meant is to express interest, but the terminology is close enough to the real intention to be maddeningly imprecise.

I also suspect this vaguely unsettling feeling will resolve itself in a few more years online: the words friend and like will simply lose all meaning. We'll know this has been achieved when people stop using air quotes to distinguish online friending versus real life friends.


This genesis of this musing came via an insightful tweet by Marshall Kirkpatrick:


told my wife that google "results from your social circle" showed me because we are friends. she insists we are more than that. true :) less than a minute ago via TweetDeck Marshall Kirkpatrick
marshallk

Wednesday, May 12, 2010

Death of Copper Predicted. Film at 11.

copper RJ45 and fibers held in a handEvery handful of years we ratchet up the Ethernet link speed: from 10 Mbps to 100 Mbps in the early 1990s, to 1 Gbps in the mid 1990s, to 10 Gbps in the early part of this century. 40 Gbps is the next target. At the 1 Gbps and 10 Gbps transitions naysayers maintained that copper cables would never be able to meet the required signaling rates and that optical would prevail. The same doubt is now being voiced about 40 Gbps.

During the 1 Gbps and 10 Gig transitions, optical media became available several years before copper, and then the initial 10 Gig copper specs were limited to patch cable distances of 10-15 meters. 40G will repeat the story with optical products already available, substantially before copper. Nonetheless I'd wager 40G copper transceivers will eventually appear in some form.

Yet this time, optical will win. Not because of the technology or limitations of copper wire, but because of economics. Economics used to be in copper's favor: simple install and no expensive lasers. Copper could ride the silicon technology curve, throwing ever more DSP power at the problem. Times have changed: cat6a and cat7 cabling is as difficult and expensive to install as fiber, and solid state laser components allow optical transports to ride the silicon technology curve.

  • Like fiber, cat6/7 cables have a minimum bending radius. Pull too tight and the cable can no longer handle long distances.
  • Like fiber, cat7 does not tolerate being stretched. Stretch a 100m cable by a centimeter and its performance suffers.
  • Even padded cable staples put too much pressure on the cable. cat7 must run in a tray or conduit, and the bulky shielding means fewer of them will fit.
  • cat7 cables are very sensitive to connectorization. The crimp tool you used for cat5e won't do.

The other problem with copper cables is that they are made of copper, an actively traded commodity. The chart below shows the raw material cost of copper over the last century, normalized to the US Dollar in 1998. During much of the late 1990s and early 2000s copper was cheap by historic standards. In the last few years the commodity price has trended back up due to demand, without a matching increase in new supply. If there is a natural ceiling for copper pricing where the market will seek alternatives, we do not appear to have hit it yet.

Price of copper since 1900 in 1998 dollars

(data source: US Geological Survey)

I'm not predicting that 40 Gig copper transceivers will be impossible. On the contrary, I suspect there will be two solutions brought to market: a very short reach spec using RJ45 patch cables, and a 100m spec which imposes more painful requirements like cat7a/cat8, use of multiple cables, and electrically better connectors (presumably also manufactured, not connectorized on site). These products will eventually appear, substantially lagging optical product availability.

I simply suspect that the economics no longer work in coppers favor: patch cables from one side of the rack to the far corner will be long enough to have to worry about install quality. If the pressure from zip-ties fastening the cable to the rack threaten the operation of your network, you're better off using fiber.

The genesis of this post came as a comment on Stephen Foskett's excellent Pack Rat blog. It is an excellent resource, highly recommended.

Wednesday, March 24, 2010

Player Piano Torpedoes

March 24, 2010 is Ada Lovelace day, an informal holiday to celebrate the achievements of women in technology and science. I'd like to share a fascinating technology story about Hedy Lamarr. Ms Lamarr was a contract star at MGM during the Golden Age of Hollywood, in the 1930s and 40s. She was also a creative and mathematically talented inventor. Today, we would proudly call her a geek.

From US Patent 2,292,387, by Hedy Kiesler Markey and George Antheil:

"This invention relates broadly to secret communication systems involving the use of carrier waves of different frequencies, and is especially useful in the remote control of dirigible craft, such as torpedoes.

Our system... employs a pair of synchronous records, one at the transmitting station and one at the receiving station, which change the tuning of the transmitting and receiving apparatus from time to time..."

Two signals are sent, labelled L and R and controlling the left and right rudders of the torpedo. L is indicated by sending a 100 Hz signal over a carrier, R by 500 Hz. Remotely controlled torpedoes had been used before the 1940s, but were often jammed by the target because the control frequency was relatively easy to detect. The innovation in this patent is the use of perforated rolls of paper to modulate the frequency rapidly enough that the enemy would not be able to predict it, making jamming difficult. The perforated rolls of paper were commonly used in player pianos of the time, requiring no special development.

In the patent application seven rows of perforations were used to control the frequency of the carrier. An eighth row of perforations lights a small lamp at the transmitting station. Three of the seven transmission frequencies were dummies which would not actually be received by the torpedo, while the lamp informed the torpedo operator when the weapon was out of contact. The intent of the dummy frequencies appears to be to mislead the enemy and make it more difficult to determine how the control system worked. Some seemingly valid transmission would not be acted upon by the torpedo, while others would.

Player piano tape
Rows A-G tune the radio to one of 7 frequencies.
Row H controls a lamp for the operator when the dummy frequencies A-C are in use.

For the transmitter and receiver to frequency hop in sync, the tape reels must begin rolling at very close to the same time and the speed of the winding must have a reasonably tight tolerance. Machined springs available in the 1930s were sufficiently precise to maintain this for several minutes, long enough to guide a torpedo to its target.

All in all its a fascinating invention which repurposed existing technology for a new purpose, in fighting the Pacific War. Unfortunately the rest of the story is not a happy one, as the invention was not taken seriously by the War Department. By the time the communication industry reinvented spread spectrum communications in the 1950s, this patent had expired.

In 1997 the EFF recognized Ms Lamarr and Mr Antheil's achievement with a Pioneer award.

Tuesday, November 10, 2009

Ethernet Integrity, or the Lack Thereof

Have you heard any variation of this claim?

We don't need our own integrity check. The TCP checksum is pretty weak, but Ethernet uses a ludicrously strong CRC. Even if you don't trust the TCP checksum, Ethernet will detect any errors.

Let's dig into this a bit, shall we?

Ethernet switch diagramA modern switch fabric chip is designed for both L2 ethernet switching and L3 IP routing. The additional logic for IP routing adds relatively little area in modern silicon technologies, while not having a routing capability would put them at a competitive disadvantage. Essentially all ethernet fabric chips, even those inside relatively cheap L2 switches, have the design features to route IPv4 traffic to at least a basic degree.

When a packet arrives at the input port (A) its CRC will be checked and the packet discarded if corrupt. If the packet is destined to the router's MAC address, its destination IP address will be looked up for L3 routing (C). An L3 router modifies the packet as part of its function, by decrementing the IP TTL and replacing the L2 destination with that of the next hop. Therefore a fresh CRC has to be regenerated at egress.

Even if the packet is to be switched at L2 (B), there are cases where the packet is modified. For example server machines and switch uplinks often handle multiple vlans, so their ports will be configured for tagging (D). Addition of the vlan tag requires the packet CRC to be recalculated on egress (E).

Vlan tagging

The point of this description? There are numerous cases at both L2 and L3 where a packet CRC cannot be preserved through the switch and will need to be regenerated at egress. ASIC designers hate special cases, as they add logic and test cases to the design. Because there are cases where the CRC must be regenerated, modern switch fabrics always regenerate the CRC at egress. Even if the packet has not been modified, even if the ingress CRC could have been preserved, it is discarded at ingress and regenerated at egress.

It bears repeating that this is a function of the chip, not the specific product. Even the tiny ethernet switches sold for practically nothing at retail use chips which contain basic vlan tagging and IP routing features (even if that product doesn't use them), and regenerate the CRC on every packet. 5 port ethernet switch The fabric chip they use wasn't specifically designed for such low cost switches, there is not enough profit to justify the effort. In addition to simple L2 switches those chips can be used to build NAT appliances, as the ethernet fan-out for small wireless access points, in DSL and cable routers, for low end WAN routers, etc. When only basic L2 switching is desired these fabrics can function completely standalone without a management CPU, reducing BOM cost to the bare minimum. Addition of a CPU allows the basic L3 functions to be used in the more featureful (but still low end) products.


 
Impact

What does this mean? The internal memories and logic paths within the switch are not covered by the ethernet CRC, it does not provide end-to-end protection. The switch might implement ECC over the whole path, but this is not common. The packet buffers are generally large enough to justify ECC, but miscellaneous FIFOs are more likely to have simple parity and logic elements often have no protection at all. It only takes one soft error to corrupt the packet contents, and then a fresh CRC will be calculated over the corrupted data.

CRC protects the wire

If you care about the data you send over the network, you should include your own integrity check at the application level. This is another good argument for using SSL: not only do you protect privacy by encrypting the data, you also get a strong end-to-end integrity check.

Wednesday, October 14, 2009

AMD IOMMU: Missed Opportunity?

In 2007 AMD implemented an I/O MMU in their system architecture, which translates DMA addresses from peripheral devices to a different address on the system bus. There were several motivations for doing this:

    Direct Virtual Memory Access
  1. Virtualization: DMA can be restricted to memory belonging to a single VM and to use the addresses from that VM, making it safe for a driver in that VM to take direct control of the device. This appears to be the largest motivation for adding the IOMMU.
  2. High Memory support: For I/O buses using 32 bit addressing, system memory above the 4GB mark is inaccessible. This has typically been handled using bounce buffers, where the hardware DMAs into low memory which the software will then copy to its destination. An IOMMU allows devices to directly access any memory in the system, avoiding copies. There are a large number of PCI and PCI-X devices limited to 32 bit DMA addresses. Amazingly, a fair number of PCI Express devices are also limited to 32 bit addressing, probably because they repackage an older PCI design with a new interface.
  3. Enable user space drivers: A user space application has no knowledge of physical addresses, making it impossible to program a DMA device directly. The I/O MMU can remap the DMA addresses to be the same as the user process, allowing direct control of the device. Only interrupts would still require kernel involvement.

 
I/O Device Latency

Multiple levels of bus bridging PCIe has a very high link bandwidth, making it easy to forget that its position in the system imposes several levels of bridging with correspondingly long latency to get to memory. The PCIe transaction first traverses the Northbridge and any internal switching or bus bridging it contains, on its way to the processor interconnect. The interconnect is HyperTransport for AMD CPUs, and QuickPath for Intel. Depending on the platform, the transaction might have to travel through multiple CPUs before it reaches its destination memory controller, where it can finally access its data. A PCIe Read transaction must then wend its way back through the same path to return the requested data.

Graphics device on CPU bus

Much lower latency comes from sitting directly on the processor bus, and there have been systems where I/O devices sit directly beside CPUs. However CPU architectures rev that bus more often than it is practical to redesign a horde of peripherals. Attempts to place I/O devices on the CPU bus generally result in a requirement to maintain the "old" CPU bus as an I/O interface on the side of the next system chipset, to retain the expensive peripherals of the previous generation.


 
The Missed Opportunity: DMA Read pipelining

An IOMMU is not a new concept. Sun SPARC, some SGI MIPS systems, and Intel's Itanium all employ them. Once you have taken the plunge to impose an address lookup between a DMA device and the rest of the system, there are other interesting things you can do in addition to remapping. For example, you can allow the mapping to specify additional attributes for the memory region. Knowing whether it is likely to do long, contiguous bursts or short concise updates allows optimizations to reduce latency by reading ahead, to transfer data faster by pipelining.

Without prefetch (CONSISTENT) With prefetch (STREAMING)
DMA Read with no prefetch DMA Read with prefetch

AMD's IOMMU includes nothing like this. Presumably they wanted to confine the software changes to the Hypervisor alone, whilst choosing STREAMING versus CONSISTENT requires support in the driver of the device initiating DMA, but they could have ensured software compatibility by making CONSISTENT be the default with STREAMING only used by drivers which choose to implement it.


 
What About Writes?

The IOMMU in SPARC systems implemented additional support for DMA write operations. Writing less than a cache line is inefficient, as the I/O controller has to fetch the entire line from memory and merge the changes before writing it back. This was a problem for Sun, which had a largish number of existing SBus devices issuing 16 or 32 byte writes while the SPARC cache line had grown to 64 bytes. A STREAMING mapping relaxed the requirement for instantaneous consistency: if a burst wrote the first part of a cache line, the I/O controller was allowed to buffer it in hopes that subsequent DMA operations would fill in the rest of the line. This is an idea whose time has come... and gone. The PCI spec takes great care to emphasize cache line sized writes using MWL or MWM, an emphasis which carries over the PCIe as well. There is little reason now to design coalescing hardware to optimize sub-cacheline writes.

Without buffering (CONSISTENT) With buffering (STREAMING)
DMA Write with no prefetch DMA Write with prefetch

 
Closing Disclaimer

Maybe I'm way off base in lamenting the lack of DMA read pipelining. Maybe all relevant PCIe devices always issue Memory Read Multiple requests for huge chunks of data, and the chipset already pipelines data fetch during such large transactions. Maybe. I doubt it, but maybe...

Thursday, October 1, 2009

Yield to Your Multicore Overlords

ASIC design is all about juggling multiple competing requirements. You want to make the chip competitive by increasing its capabilities, by reducing its price, or both. Today we'll focus on the second half of that tradeoff, reducing the price.

Chip fabrication is a statistical game: of the parts coming off the fab, some percentage simply do not work. The vendor runs test vectors against the chips, and throws away the ones which fail. This is called the yield, and is the primary factor determining the cost of the chip. If the yield is bad, so you have to fab a whole bunch of chips to get one that actually works, you have to charge more for that one working chip.

To illustrate why most chips coming out of the fab do not work, I'd like to walk through part of manufacturing a chip. This information is from 1995 or so, when I was last seriously involved in a chip design, and describes a 0.8 micron process. So it is completely old and busted, but is sufficient for our purposes here.

Begin by placing the silicon wafer in a nitrogen atmosphere. You deposit a photo-resist on the wafer, basically a goo which hardens when exposed to ultraviolet light. You place a shadow mask in front of a light source; the regions exposed to light will harden while those under the shadow mask remain soft. You then chemically etch off the soft regions of the photo-resist, leaving exposed silicon where they were. The hardened regions of photo-resist stay put.

Next you heat the wafer to 400 degrees and pipe phosphorous into the nitrogen atmosphere. As the K atoms heat they begin moving faster and bouncing off the walls of the chamber. Some of them move fast enough that when they strike the surface of the wafer they break the Si crystal lattice and embed themselves in the silicon. If they strike the hardened photo-resist, they embed themselves in the resist; very, very few are moving fast enough to crash all the way through the photoresist into the silicon underneath.

Electron Microscopy image of a silicon die

Next you use a different chemical process to strip off the hardened photoresist. You are left with a wafer which has phosphorous embedded in the places you wanted. Now you heat the wafer even higher, hot enough that the silicon atoms can move around more freely; they move back into position and reform the crystal lattice, burying the phosphorous atoms embedded within. This is called annealing. Now you have the p+ regions of the transistors.

You repeat this process with aluminum ions to get the n- regions. Now you have transistors. Next you connect the transistors together with traces of aluminum (which I won't go into here). You cut the wafer to separate the die, and place each die in a package. You connect bonding wires from the edge of the die to the pins of the chip. And voila, you're done.


It should be apparent that this is a probabalistic process. Sometimes, based purely on random chance, not enough phosphorous atoms embed themselves into the silicon and your transistors don't turn on. Sometimes too much phosphorous embeds and your transistors won't turn off. Sometimes the Si lattice is too badly damaged and the annealing is ineffective. Sometimes the metal doesn't line up with the vias. Sometimes a dust particle lands on the chip and you deposit metal on top of the dust mote. Sometimes the bonding doesn't line up with the pads. Etc etc.

This is why the larger a chip grows, the more expensive it becomes. Its not because raw silicon wafers are particularly costly, its that the probability of there being a defect somewhere grows ever greater as the die becomes larger. The bigger the chip, the lower the yield of functional parts.

Intel Nehalem chip with SRAM highlited For at least 15 years that I know of, chip designs have improved their yield using redundancy. The earliest such efforts were done in on-chip memory: if your chip is supposed to include N banks of SRAM, put N+1 banks on the die connected with wires in the top most layer which can be cut using a laser. The SRAM occupies a large percentage of the total chip area, statistically it is likely that defects will be within the SRAM. You can then cut out the defective bank, and turn a defective chip into one that can be used. More recent silicon processes use fuses blown by the test fixture instead of lasers.


 
Massively Multicore Processors

Yesterday NVidia announced Fermi, a beast of a chip with 512 Cuda GPU cores. They are arranged in 16 blocks of 32 cores each. At this kind of scale, I suspect it makes sense to include extra cores in the design to improve the yield. For example, perhaps each block actually has 33 cores in the silicon so that a defective core can be tolerated.

In order to avoid having weird performance variations in the product, the extra resources are generally inaccessible if not used for yield improvement. That is even though the chip might have 528 cores physically present, no more than 512 could ever be used.

NVidia Fermi GPU

Tuesday, September 15, 2009

Soft Errors Are Hard Problems

"Soft Error" is a euphemism in the semiconductor industry for "the silicon did the wrong thing." Soft errors can occur when a circuit is infused with a sudden burst of energy from an external source, for example when it is hit by a high energy subatomic particle or by radiation.

Alpha particle strike - two protons plus two electrons, emitted when a heavy radioactive element decays into a lighter element. Alpha particles are so large that the chip packaging will normally block them, they are only a problem when something inside the package undergoes radioactive decay.

Cosmic ray strike - a high energy neutron (or other particle) emitted by the Sun. These particles are gradually absorbed by the Earth's atmosphere, so they are more of a problem in orbit and at high altitude. Cosmic rays can directly impact the silicon, or can hit a nearby atom and throw off neutrons which in turn cause a soft error.

Beta particle strike - an electron, emitted when a neutron is converted into a proton + electron + antineutrino. Beta particles rarely hold enough energy to affect current silicon technology, alpha and cosmic ray strikes are more of a problem.

A DRAM bit Soft errors are usually discussed in the context of DRAM, where the problem was initially noticed. DRAM consists of a capacitor to store the bit, with a transistor to keep the capacitor charge stable. A capacitor is an energy storage circuit: it stores voltage. A particle strike on the capacitor will impart a large amount of energy, which can spontaneously change it to a 1 or, more rarely, overload its capacity such that the energy quickly escapes into the substrate and leaves the bit as a 0.

Some quick searching will turn up a few facts about soft errors:

  • Soft errors were first noted in the 1970s.
  • The primary cause was the use of slightly radioactive isotopes in the chip packaging, such as lead (Pb-212).
  • Materials in chip packaging are now carefully screened to substantially eliminate radioactivity.
  • Soft errors are now very rare and mostly caused by cosmic rays.

We'll come back to these later.


 
Beyond DRAM A 6 transistor SRAM bit

Soft errors are not confined to DRAM alone. Any circuit will glitch if hit by a sufficiently energetic particle - whether DRAM, SRAM, or a logic element. DRAM began to be affected by soft errors when the energy stored in the capacitor shrunk to be on the order of the energy induced by an alpha particle. SRAM was not initially affected because its cells are actively driven via 6 transistors, whose energy level is considerably higher. Nonetheless as silicon feature sizes have shrunk it is now quite possible for SRAM to suffer a soft error. Soft errors in logic elements are somewhat less noticeable in that they will correct themselves on the next clock cycle, while an error in a storage element will persist until it is rewritten.

Intel Nehalem with SRAM highlighted Modern CPUs include a great deal of SRAM on the die, comprising the caches, TLBs, reorder buffers, and numerous other uses. The image shown here is Intel's quad core Nehalem die, with the SRAM areas highlighted and logic deemphasized (both based on my best guesses). SRAM is a significant fraction of the die. Many, many other ASIC and CPU designs contain similar or higher fractions of SRAM.

What impact can a bitflip in the SRAM have? Consider that there is just one bit difference between the following two instruction opcodes. A bitflip can make the software come up with results which should be impossible.

ADD R1,1
ADD R1,32769

Even more bizarrely a bitflip could change some random instruction into a memory reference, such as a load or store. As the register being dereferenced would likely not contain a valid pointer, the process would segfault for inexplicable reasons.

To prevent this problem Intel and all major CPU vendors protect their caches and other on-chip memories using Error Correcting Codes, but many ASIC designs do not. They might implement parity, but on-chip ASIC memories commonly have no error checking at all. A soft error will simply corrupt whatever was in the SRAM, which will only be noticed if it causes the ASIC to misbehave in some perceptible way.

What happens if a particle strike causes the hardware to misbehave, or to get the wrong answer? Usually, we blame the software. It must be some weird bug.


 
Back To The Future

Returning to the earlier list of common facts about soft errors, lets focus on the last two.

  • Materials in chip packaging are now carefully screened to substantially eliminate radioactivity.
  • Soft errors are now very rare and mostly caused by cosmic rays.

There are many different materials used in a finished chip, beyond the silicon die and the gold wires connecting to its pins. The chip package is plastic or ceramic, which is composed of a host of different elements including boron. Solder bonds the wires to the pins. Solder used to be mostly lead, later tin, and might now be a polymer. There are heat spreading compounds and shock absorption goo, which are often organic polymers. Some of these materials are naturally slightly radioactive. For example, in nature Lead (Pb-208) contains traces of Pollonium (Po212) which will emit alpha particles and decay into Pb-208. Similarly boron-10 is more prone to fission than boron-11 - or so they tell me, I've no idea why.

Modern chip packaging uses strained versions of these materials, to reduce the level of undesirable isotopes and leave purified inert material behind. This is an expensive process. Each new generation of silicon imposes more stringent requirements, making them even more costly. It is crucial that the correct materials be used... and this is where human error can creep in.

Using the wrong packaging materials increases alpha emissions to the point where the silicon will experience an unacceptable soft error rate. Yet to control costs it is also important to not overshoot the alpha emission requirements by using a more expensive material than necessary. So each chip design may use a different mix of materials depending on its process technology, die size, and the amount of SRAM it contains. The manufacturer will have checks in place to ensure the correct materials are used, but mistakes can occur. Sometimes a batch of chips is produced which suffers an unusually high soft error rate. When this happens the manufacturer is generally loathe to admit it, and will quietly replace the chips with a corrected batch. One recent case where the problem was too widespread to cover up was with the cache of the Ultrasparc-II CPU from Sun Microsystems, see point #5 of an Actel FAQ on the topic for more details.


 
The Moral of the Story

The impossible triangle, a tribar If you are working on low level software for a chip and run into bizarre errors, you should suspect a software problem first. Soft errors really are rare, and the manufacturing screwups described above are very uncommon. If the problem is repeatable, even if it has only happened twice, it is not a soft error. Particle strikes are too random for that. However if you keep running into different symptoms where you think "but that is impossible..." you should consider the possibility that the problem really is impossible and was introduced by a bitflip. You should start checking whether the problems are confined to a particular batch of parts, or only produced in a certain range of dates. If so, it could be that batch of parts has a problem with the packaging materials.


 
Other resources

While researching this post I came across a few additional sources of information which I found fascinating, but did not have a good place to link to them. They are presented here for your edification and bemusement.

  • An analysis of the high soft error rate of the ASC Q supercomputer at LLNL. These errors were cosmic ray induced, not due to a badly packaged batch of chips. The part of the system in question did not implement ECC to correct errors, only parity to detect them and crash.
  • Cypress Semiconductor published a book to explain soft errors to their customers.
  • Fujitsu has a simulator to predict soft error rates, called NISES. The linked PDF is mostly in Japanese, but the images are fascinating and very illustrative.

This post was many years in the making.

Monday, August 24, 2009

Plummeting Down the Chasm

Crossing the Chasm book coverCrossing the Chasm is a seminal book in technology marketing, whose ideas quickly spread through the industry. Originally written in 1991 by Geoffrey Moore, it showed a new take on the technology adoption lifecycle. The lifecycle starts with tech enthusiasts willing to buy an immature product and runs through the majority buyers who make up the bulk of the market, finally trailing off when market saturation is reached. It had been commonly depicted as a bell curve:


Technology Adoption Life Cycle

Moore's key observation is that while the bell curve implies there is a smooth transition from early market to majority, in reality the buyers in the early market are fundamentally different from the majority that comes later. Technology companies who don't appreciate this gap will stumble and often fail once they saturate the small pool of early buyers. Moore referred to this gap as "the chasm."

Technology Adoption Life Cycle

Early adopters are visionaries. They are willing to look at an immature product and figure out how to use it in their operations to get a competitive advantage. They will sponsor integration work within their IT departments, and generate long lists of product feedback to better fit their needs. They are fundamentally different from the majority buyers in that they will look at an interesting product and figure out what problem it can solve. The majority market comes from the opposite direction with a problem to solve, looking for a solution. Product planning and marketing which works in the early part of a product lifecycle will fail utterly later in the game.


 
Yon Chasm Approacheth

It is quite possible to get stuck in the early market, continuously trying to meet the needs of early adopters and never enjoying the big sales of the majority market. Having spent the last several years in this predicament, I'll offer my version of the lifecycle chart. What it lacks in precision, it makes up for in snark.


Rope bridge with gaps

As a development engineer it can be difficult to tell how well the sales cycle is working as one rarely gets direct visibility, but the indirect evidence is plentiful.

  • If nearly every deal comes with a list of new product features to be implemented, you have not crossed the chasm.
  • If in every medium to large deal it is not clear whether the cost of getting the business is greater than the revenue it would bring, you have not crossed the chasm.
  • If every deal is "high touch," requiring multiple visits by a salesperson and sales engineer and possibly a consultation with the development team, you have not crossed the chasm.
  • If your product cannot be sold via a web site but instead always requires an evaluation period and report, you have not crossed the chasm.
  • If every customer is using a different subset of the product functionality, you have not crossed the chasm.

This is important: if the company does not realize that the real problem is in the approach to the market, all of these things will be blamed on the product. The reasoning will be that "if we just pound out a few more of these deals, we'll have finally implemented everything that everybody wants and sales will take off." There may even be an element of truth in this sentiment, if the product shipped early before its natural feature set was complete. However if the company has not crossed the chasm, the fundamental problem is elsewhere.

You are not getting a list of feature requirements because the product is incomplete, but because you are selling to the type of customer who generates lists of requirements.

You are getting that list of requirements because you are still selling into the visionary and early adopter segments of the market, the people who are willing to think about how best to integrate the product into their operations. If you were selling into the mainstream market there would be no list of requirements because the mainstream won't do an extensive integration on their own. The product either meets their needs or it doesn't, and you'll either get the sale or you won't. In the mainstream there will be no back and forth of what the product could do to win the business. At most, the mainstream buyer might tell you why you lost the business.

Don't get stuck wandering around in the chasm. Trust me, it sucks.

Tuesday, July 14, 2009

DRY and the DMV

The Pragmatic Programmer is one of the best books available concerning the development of quality software. It is structured as a series of tips, with illustrative examples and the occasional horror story. One of the first tips is the DRY principle:

DRY - Don't Repeat Yourself
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

DRY is often misinterpreted to mean simply that code should not be duplicated, but it is somewhat more subtle: don't duplicate state. If you have multiple different places in the code which keep state about an aspect of the system, and all places have to have the same content at all times for the system to work properly, then you have made maintenance of the system harder than it needs to be. You'll have to debug cases where the representations fall out of sync, and all such places must be updated at the same time when code changes are made. The Pragmatic Programmers extend the DRY principle outside of the code itself to include database schemes, documentation, and build systems. Everything should have one authoritative source.

This brings us to the Department of Motor Vehicles, though the Gentle Reader might at first not see the connection. I received a form in the mail to renew my driver's license, which I promptly signed and sent back. The new license arrived in due course, and things were fine until a few weeks later when I noticed the address was incorrect. The old license is correct, the new one is wrong.

sample CA drivers license

I've no idea whether the address was correct on the renewal form, I did not check it. Apparently I should have, but I didn't bother - it hadn't changed. At some stage of the renewal process, a single digit was altered in a subtle way.

Why was it even possible for the address to be changed in the renewal process? Here we can only speculate. The DMV does need a procedure to update an address as part of a license renewal, because sometimes people supply a new one on the form. I'll speculate that the DMV, either via OCR or manual typing, re-enters the address in all cases and not just if the form supplied a change. This procedure depends on the original address to be faithfully reproduced in cases where it wasn't supposed to change. In my case, either due to OCR glitch or typing error, a digit changed resulting in the new license being printed with an incorrect address.

I believe this is an example of the consequences of a violation of the DRY principle. The same state - my address - exists in two places: in the DMV database and on the form. Those two pieces of state are supposed to be the same, indeed must be the same for the process to work correctly, but errors can easily occur which allow their contents to get out of sync.

A corollary lesson in this situation: if state isn't supposed to change, don't change it. If the form does not indicate a change of address, the authoritative state is in the database and the form contents should be ignored.


 
Aftereffects

I've already received a jury summons at the incorrect address, which the post office helpfully delivered to me anyway. Even after correcting the address I suspect I will receive a summons twice as often from now on. That will form the basis of a future blog post to illustrate the importance of duplicate suppression in databases, I suppose.

I can change my address back by submitting a form to the DMV, but issuing a new license with the correct address will be at my own cost. This is part of the price of modern life, I suppose.