Thursday, July 8, 2010

Virtual Instruction Sets: Opcode Arguments

Its a virtual CPU fan, get it?Virtual Machine architectures are a fascinating topic, and one that I plan to occasionally explore in this space. Not virtual machines in the sense of VMWare or Xen, rather the runtime environment for a programming language like Java or Python. This time we'll focus on the structure of the instruction set, in particular on how operands are passed and stored. Why are these low level details important?

  • Traditional compilers emit instruction sequences without knowing anything about the specific CPU model, system configuration, or input data to be processed.
  • The compiler can optimize for a specific CPU pipeline, and maybe even produce multiple binaries for different CPUs. As a practical matter you cannot produce a large number of variations due to the sheer size of the final binary image.
  • Profile-driven compilation can optimize for representative data you supply during the build phase, but representative data is always a guess and a compromise. Also as a practical matter, its difficult to use profile-driven optimization for many applications, such as GUIs.
  • Only a JIT for a virtual machine has the luxury of knowing the specific CPU, system configuration, and has profiling information from the current input data.

The hardware CPU architectures we use now have evolved in lockstep with compiler technology, and mostly C/C++ compilers at that. They have enormous I$ and D$ because the compiler cannot predict very much about what it will execute next. The hardware has extensive branch prediction logic and history tracking because compilers emit an average of one branch every 7 instructions.

Virtual Machines change everything: by profiling the running code they can produce instructions for this specific workload, resulting in long sequences of very predictable opcodes without branches or conditionals. It has the potential to change hardware architectures, once we pass a tipping point where most of the workload runs within a VM. I suspect this tipping point will be reached in mobile devices well before it impacts workstations, laptops, or servers.

This rosy prediction is by no means certain. The JIT for most current VMs will compile a function the first time it is used. They can optimize for the CPU and possibly even take memory size into account, but they don't use any profiling information. Thus the JIT can potentially get the benefit of compiling for the specific CPU pipeline on which it runs, though in practice even this isn't typically done. So far as I know of the VMs discussed here only Mozilla's Spidermonkey makes use of tracing to produce specifically optimized routines according to the input data being processed.

We're going to examine seven virtual machines, focussing on how operands are passed: the JVM, CLR, Spidermonkey, LLVM, Parrot, V8, and Dalvik.


 
JVM & CLR

JVM argument stackThe Java Virtual Machine and the Common Language Runtime used by .Net are certainly very different, but as virtual machines go they have a lot in common. Both are stack based: operands to an instruction are popped from the stack, and the result is pushed.

Stack based virtual machines are relatively common, because they are conceptually very simple. Indeed many early microprocessors and microcontollers were stack based, because the silicon technology of the day wouldn't allow a CPU with a generous number of registers on the die. In that sense virtual CPUs are following the same evolutionary path as hardware CPUs did several decades ago, starting with stack based machines and adding registers later.

Stack-based instruction sets tend to have a very high code density, because their opcodes don't need to encode source and destination register numbers. When the JVM was developed in the early 1990s, processor caches were measured in the tens of kilobytes. A densely packed bytecode was a big advantage, far more bytecode could be stored in the hardware CPU's data cache.


 
Spidermonkey (Firefox)

SpiderMonkey is the Javascript engine in Firefox, and is a stack-based machine like the JVM and CLR. What I find most interesting about SpiderMonkey is that it tackled profile-driven JIT optimization first, via TraceMonkey in the latter part of 2008. A more conventional method-compiling JIT came later, via JaegerMonkey in early 2010. The virtue of doing things in this order is pretty compelling: tracing, when it works, can deliver spectacular gains. However tracing really only helps with loops, leaving lots of low hanging fruit for a method-based JIT. Doing the method-based JIT first makes it more difficult to get the profiling information which tracing needs. By doing TraceMonkey first, its instrumentation needs became part of the requirements for JaegerMonkey.


 
LLVM

The primary design point of the LLVM project is a compiler toolchain, and the LLVM instruction set was designed to be the intermediate representation between the language-specific frontend and more generic backend. The LLVM instruction set defines a register based virtual machine with an interesting twist: it has an infinite number of registers. In keeping with its design point as a compiler intermediate representation, LLVM registers enable static single assignment form. A register is used for exactly one value and never reassigned, making it easy for subsequent processing to determine whether values are live or can be eliminated.


 
Parrot

Parrot is also a register based virtual machine. It defines four types of registers:

  1. Integers
  2. Numbers (i.e. floating point)
  3. Strings
  4. Polymorphic Containers (PMCs), which reference complex types and structures

Like LLVM, Parrot does not define a maximum number of registers: each function uses as many registers as it needs. Functions do not re-use registers for different purposes by storing their values to memory, they specify a new register number instead. The Parrot runtime will handle assignment of virtual machine registers to CPU registers.

So far as I can tell, integer registers are the width of the host CPU on which the VM is running. A Parrot bytecode might find itself using either 32 or 64 bit integer registers, determined at runtime and not compile time. This is fascinating if correct, though it seems like BigNum handling would be somewhat complicated by this.


 
V8 (Chrome)

V8 is the JavaScript engine in the Chrome browser from Google. Its a bit of a misnomer to call V8 a virtual machine: it compiles the Javascript source for a method directly to machine code the first time it is executed. There is no intermediate bytecode, and no interpreter. This is an interesting design choice, but for the purposes of this article there isn't much to say about V8.


 
Dalvik (Android)

Dalvik virtual machine registersDalvik is the virtual machine for Android application code. The Dalvik instruction set implements an interesting compromise: it is register based, but there are a finite number of them as opposed to the theoretically infinite registers of LLVM or Parrot. Dalvik supports 65,536 registers, a vast number compared to hardware CPUs and presumably sufficient to implement SSA (if desired) in reasonably large functions.

Even more interestingly, not all Dalvik instructions can access all registers. Many Dalvik instructions dedicate 4 bits to the register number, requiring their operands to be stored in the first 16 registers. A few more instructions have an 8 bit instruction number, to access the first 256. There are also instructions to copy the value to or from any of the 65,536 registers to a low register, for a subsequent instruction to access.

It took a while to understand the rationale for this choice, and I'm still not confident I fully get it. Clearly the Dalvik designers believe that keeping data in one of the high registers will be faster than explicitly storing it to memory, even if the vast number of registers end up mostly residing in RAM. Addressing data as register numbers instead of memory addresses should make it easier for the VM to dynamically remap Dalvik registers to the real hardware registers. For example, if it can predict that virtual register 257 will likely be used in the near future it can be kept in a CPU register instead of being immediately stored to memory.


 
Other VMs

There are many, many more virtual machine implementations beyond the ones implemented here. The Python, Smalltalk, and Lua programming languages each have their own VM instruction set and implementation. Erlang started with a VM called JAM, and later reimplemented the underpinnings in a new virtual machine called BEAM. Adobe Flash has a VM which has been open sourced and donated to the Mozilla project as Tamarin. Wikipedia lists brief descriptions of a number of current VMs.