Coding Relic: The High-Level CPU Response

This time, Gentle Readers, we will explore something completely different. It is still related to embedded systems, for a suitably generous definition of "embedded." Several weeks ago I read a posting by Yossi Kreinin regarding CPU architectures optimized for very high level languages, plus a followup on the same topic.

Yossi said:
Do you think high-level languages would run fast if the stock hardware weren't "brain-damaged"/"built to run C"/"a von Neumann machine (instead of some other wonderful thing)"? You do think so? I have a challenge for you. I bet you'll be interested.
...
My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it?

Really, the challenge would not normally appeal to me:

I generally don't spend time on high-level languages. I make my living in the guts of the system writing assembly and C code, where Lisp/Haskell/Erlang/<ShinyNewLanguage>/etc do not play.
I like C. Really, I do. I like assembly coding too. Really, I do.
I think von Neumann architectures are fine.

However in the past few years I've worked on three different ASICs involving custom embedded CPUs for the datapath (more accurately: two CPUs and one glorified state machine), and Yossi's challenge is interesting in a contemporary problem space. So I'll give it a whirl. Be warned, Gentle Readers, that it will probably suck. You have been warned.

The instruction set

Intel Atom CPU In 2008, I believe the instruction set is the wrong place to look for these performance gains. The overwhelming effort put into optimizing the performance of current CPU architectures represents an enormous barrier to entry for a new instruction set. Intel itself has not made much headway with Itanium, unable to successfully compete with its own x86 products. A decade ago RISC was hailed as the superior alternative, but x86 now stands over the dead bodies of several of its erstwhile RISC competitors: Alpha and PA-RISC, for a start. Modern CPU instruction sets are pragmatically extended with new opcodes as needed, witnessed by the sheer number of times Intel and AMD have added more vector instructions to their x86 implementations.

Also (and crucially) CPU architectures so specialized as to only run code from a single programming language tend not to survive. There is a reason that Lisp Machines and the P-system CPUs no longer exist: they weren't terribly adaptable.

So I won't be suggesting a radically new instruction set.

Exploring the problem space

As I see it, there are three standout features of various high level languages which one might attempt to optimize for in the CPU design:

dynamic typing: Type checks are performed at run time, not compile time like C or Java
stackless languages: like Erlang, to support astonishing numbers of threads
functional languages: functions have no side effects and there is no global state. The complete state a function operates on will be passed to it, by value not by reference

For #1, Dynamic Typing: I don't think pushing type information down to be directly visible to the CPU is really productive. I'm quite likely wrong in this (there have certainly been a large number of processors over the years with tagged arithmetic operations), but I'm also not really qualified to comment further. So I won't.

For #2, Stacklessness: Many CPU designs have precious little hardware dedicated to "the stack." For example, in MIPS32 by convention two registers are used to store sp and fp, but they are really just normal registers and the compiler uses regular ADD/MOV/etc instructions to adjust them. Other CPU families may have more hardware supporting the concept of a stack, but I just don't think there is much to be gained by removing it. There is also a whole lot to be lost, if it is no longer possible to run stack-based languages like C.

For #3, Functional Languages: I believe this is the topic which has the most traction, and is where we'll focus for the rest of this writeup. Passing a large footprint of arguments by value implies a great deal of copying. Of course the runtime for these languages make efforts to avoid unnecessary copies, but perhaps modifications to the CPU can help the runtime perform this function more efficiently.

Avoiding unnecessary copies

Though the programming model for a functional language may guarantee each function call receives its own independent copy of all arguments, in implementation the runtime will avoid copying large amounts of data unnecessarily. If a function only reads and never writes an argument, it can be allowed to reference the parent's copy of the argument. The crucial bit, of course, is ensuring that there is no code path which results in writing to the parent data.

If the functional language compiles down to machine instructions, then any possible code path which results in modification of an argument makes it unsafe to reference the parent data. Even if the language is implemented by a VM, there is overhead in checking each store bytecode at runtime to ensure its target is allowed.

In other problem spaces there is a well-understood technique for optimistically allowing a child a read-only reference to the parent's data: Copy On Write. The relevant pages of memory will be mapped read-only. If a process attempts to write to this page it will trap into the operating system, which will make a private copy of the page for the process to modify. If Copy on Write were available, the functional language runtime would have another option available.

The trouble with applying existing Copy on Write mechanisms to functional languages is the sheer frequency of mappings. The current uses of Copy on Write are infrequent: process creation, periodic (but rare) snapshots of data, etc. The overhead of calling into the Operating System to change page tables is acceptable for these uses. The overhead of calling into the OS to change page tables for every function call would likely be prohibitive.

Overview of the proposal

"There is no problem in computer programming which cannot be solved by an added level of indirection." -- Dr Maurice Wilkes

64 bit CPUs are now the norm in the general purpose computing market. This is a truly staggering amount of virtual address space. If the application code could manage a section of its own address space with very low overhead, it could create and destroy mappings as needed. For cases where it makes sense, the language runtime could set up a read-only alias mapping of the parent data, and pass the alias address to the function. If the function attempts to modify its arguments there would be a trap, and a copy could be made.

Before delving further lets examine when it makes sense to do this:

if the function cannot possibly write to its arguments, and the runtime can be certain of this, there is no reason to create an aliased mapping. Just pass in the parents data.
if one or more of the arguments to the function are relatively small, just copy them. The brute force solution is cheaper than trying to finesse it.
if the function might modify its arguments and they are large, we have a candidate for an aliased argument.

What would the hardware need to provide, the Gentle Reader might ask? Nothing much really, just a second MMU TLB. Simple. Ok, I kid, thats actually an enormous burden to place on the hardware, but I think it might be worthwhile. This is how I think it would work:

I'll refer to the new TLB as the Aliasing TLB, which is looked up first. I'll refer to the existing MMU as the System TLB.
The CPU has a pair of limit registers, giving an upper and lower bound of addresses to be looked up in the Aliasing TLB. Most processes would probably not use the Aliasing TLB at all, and the limit registers would be zero.
Applications using this mechanism would be given a section of their address space to manage - an exabyte should do. Whenever there is a context switch, system code (i.e. the kernel) updates the limit registers for the new running process.
Memory operations whose target is within the range of the limit registers would be looked up in the Aliasing TLB. We'll come back to the miss case later, for now assume an entry is found.
The result of the Aliasing TLB lookup is a mapping to another virtual address, plus permission bits.

If the permissions allow the operation, the CPU now has a new virtual address to use as the target of the memory operation. This new virtual address (labelled Va in the diagram) proceeds down the "traditional" CPU MMU path, searching the System TLB to find the physical address. If the original target address was not within the range of the limit registers, then the unmodified virtual address goes to the System TLB to be translated to a physical address.

Handling a miss

Everything is fine and dandy if the CPU finds a matching entry in the Aliasing TLB, but what if it doesn't? I'm very fond of software-managed TLBs like MIPS and Sparc v9, and had initially written up a section on how the CPU could directly vector over to a TLB miss handler in the application's address space. The handler would need to be locked into memory so it could never be paged out, and immediately switch to a pre-allocated stack specifically for TLB handling, and ... yeah, it was really complicated.

On further reflection, I don't think it needs to be that complicated. There are two characteristics of the Aliasing TLB which are different from the System TLB: entries are added to the Aliasing TLB immediately before they are used, and the entries only live for the span of one function call - generally speaking, a very short lifespan.

When the OS adds a new entry to page tables the new entry is generally not immediately added to the System TLB, instead it will be added the first time the TLB faults on the new address. With the Aliasing TLB, when a new entry is added it should immediately be pushed into the hardware before resuming processing. We're expecting to immediately hit that address, and the number of Aliasing TLB misses would be dramatically reduced by doing so. Perhaps the misses which remain can take a more normal path of trapping into the OS, which would perform an upcall to the application to service the Aliasing TLB miss.

Page sizes

Function arguments can be of any size and alignment, and may not be aligned to any particular boundary. An earlier draft of this posting proposed segmented memory support, to handle arbitrary argument sizes. On further reflection, I don't think it needs to be that complicated. The Aliasing TLB mechanism only makes sense for relatively large arguments, and I think it is a reasonable tradeoff to require the runtime to pad such arguments to be an even multiple of a page size.

I do think it is important to support multiple page sizes, with the normal page size of the System TLB as a lower bound and truly enormous pages at the upper end. CPU MMUs have supported multiple page sizes for a long time, though Operating Systems generally only use one size for all memory pages. This is to avoid the complexity of fragmentation if one tried to actively page out memory with multiple page sizes in use. Some OSes do use large pages to map in things which can never be paged out, like the framebuffer.

Page replacement is not a concern for the Aliasing TLB: it is mapping one virtual address to another, there is no concept of removing a page from memory. If we have a 128 Megabyte argument to map, we might as well use a 128 Megabyte page to map it. Assembling a bunch of small mappings to handle a huge region is only beneficial in very specific cases. The software will take advantage of multiple page sizes in the Aliasing TLB.

Miscellaneous notes and drawbacks

The most serious downside of the Aliasing TLB is the latency it adds to instruction processing. To hide that latency the CPU will need additional pipeline stages, possibly many additional stages. Nobody wants a repeat of the SuperPipelineing of the Pentium IV, the additional pipelining may be a fatal drawback of this idea.
We'll need Address Space Identifiers (ASID) for the Aliasing TLB. The OS must populate the ASID of the currently running process in a CPU register, and any mappings created by the user code must inherit this ASID. If we allow application code to supply the ASID we'll destroy the security of the system: it could set up nearly arbitrary mappings in another process.
The map addresses in the Aliasing TLB are always looked up in the System TLB, never recursively looked up in the Aliasing TLB even if they fall within the specified range of addresses. I do not believe multiple levels of indirection add any capability to the mechanism, and the hardware complexity in the pipelining would be prohibitive.
I don't think it is reasonable to expect application code to deal with page coloring, so a CPU implementing an Aliasing TLB should use physically indexed caches (or some other solution which does not burden the software with coloring). This is probably not an issue, I believe page coloring has not been employed by any common processor for about ten years. A long time ago I worked on software for a Sparc processor which required it, and still bear the scars.

The disclaimer

Would an Aliasing TLB really boost the performance of a functional language runtime, and by how much? I've no idea. That is the sort of question best answered by an analysis of the behavior of existing functional language runtimes. Analyzing instruction and memory traces is the seventh most boring task in all of computer science, which means it is very, very boring. Since I'm writing this for a blog post, I choose to skip the boring and tedious stuff in favor of emphatic handwaving.

I'm not sure I've actually addressed the challenge Yossi issued. He was pretty clearly seeking CPU instruction set ideas, and I haven't provided any. Also the challenge was issued in late January and subsequent comments indicate it is pretty much wrapped up, so I'm late to the party. Nonetheless it was interesting to write this up, so to me the time was well spent.

Thursday, May 8, 2008

The High-Level CPU Response