Wednesday, January 5, 2011

Overlays Not Yet Extinct

In 2010 both Ian Lance Taylor and Dave Miller wrote about STT_GNU_IFUNC, an ELF symbol type supported by the GNU compiler and linker. These symbols are functions, but do not appear at a fixed address. Instead STT_GNU_IFUNC is a function which returns an address to the function you actually want to call. STT_GNU_IFUNC solves a common problem in supporting platform variations which are similar enough to use the same binary, but different enough to benefit from specific optimizations. A good example is block copy operations like memcpy(). Depending on the CPU one might be able to use SIMD instructions like SSE or MMX, a block move engine in a cache controller, or one of a number of different unrolled loops optimized for specific CPU pipelines.

STT_GNU_IFUNC is the new hotness, but I'm going to describe a different technique for accomplishing similar functionality using the program loader, i.e. the code which loads program text into memory before execution. This might be a boot ROM environment like Das U-Boot, or in the case of a datapath CPU it might be a process running on a separate control CPU. My use of this technique was the latter case, where a deeply embedded CPU in the datapath had its code loaded into its memory by control software running elsewhere.

The key technique in this scheme uses overlays. Yes, overlays.


 
Set Wayback Machine to 1974

DEC PDP11/20 illustration of program overlays The typical minicomputer in the 1970s had tens to hundreds of kilobytes of memory. Though programs were dramatically smaller, available RAM was still a significant limitation and practices to maximize RAM efficiency were common. One common technique was overlays: multiple segments of code compiled at the same address to be loaded when needed and replaced later. As only one such code segment could be present in memory at a time, the different overlays could not reference each other and ideally would not need to be swapped very often. A common use was to put initialization-specific code in an overlay and replace it with other code once the init was done.

Note that this was not the same as virtual memory. Some minicomputers of the time had virtual addressing hardware, but it was not universal. Overlays were simply regions of memory which the application would overwrite. Modern CPUs and practices make this use of overlays more difficult, with program text generally mapped read-only and large instruction caches which need to be flushed of stale contents. Overlays as a practice are now practically unknown on any system with virtual memory, we rely on the VM system to kick unneeded code out of RAM.


 
Overlays Today

The GNU linker nonetheless still has support for overlays. A linker script can specify a group of ELF sections to appear at the same address. The linker provides no help in managing the overlay segments in memory, this is left entirely to the developer. We will use this linker support to provide multiple implementations of an API, tuned for different CPUs.

The first step is to implement the code for each CPU. In this example we'll use something trivial, a function which returns a different integer value for each platform. Two attributes are added to each implementation:

  • noinline - we want to choose which version of the function to load at runtime. This cannot work if the compiler inlines the first one it finds.
  • section - each implementation of the function goes in its own ELF section. This will be discussed later.
int foo_v0() __attribute__((noinline,section(".s_foo_v0")));
int foo_v0() {
  return 0;
}

int foo_v1() __attribute__((noinline,section(".s_foo_v1")));
int foo_v1() {
  return 1;
}

int foo_v2() __attribute__((noinline,section(".s_foo_v2")));
int foo_v2() {
  return 2;
}

illustration of multiple functions in a section Each ELF section can contain exactly one function which is to be called from elsewhere in the code. Defining multiple functions in one ELF section doesn't work with this technique: we rely on placing all versions of the function at the same address. If there are multiple functions they can be at different offsets, so code elsewhere in the program won't have the correct address to call. It would be possible to define multiple functions which are only ever called from other routines in the section, this is left as an exercise to the reader.

Similarly, though the implementations can differ substantially the function signature has to be exactly the same for all variants. They must return the same type and take the same arguments, even if that means some of the variants have an argument which they ignore.

If we examine the object file we can see the sections we defined (other sections omitted for brevity):

$ objdump --section-headers foo.o
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000031  0000000000000000  0000000000000000  00000040  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  3 s_foo_v1      0000000b  0000000000000000  0000000000000000  00000074  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  4 s_foo_v2      0000000b  0000000000000000  0000000000000000  0000007f  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  6 s_foo_v0      0000001d  0000000000000000  0000000000000000  0000009b  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE

 
OVERLAY : NOCROSSREFS

The next step is the crucial one, linking the binary. We're going to use a custom linker script which tells ld to arrange those sections as an overlay. The linker can take only one script as input. To enable our overlay, we must also support everything else the linker normally does for binaries on this platform. If you're not already using a linker script, you need to retrieve the default script for your platform using ld --verbose and look for a SECTIONS block in which to add the handling. A snippet of my linker script is shown here, with the added text bolded.

SECTIONS
{
  ... bunch of stuff ...

  .text : {
    ...
  }

  PROVIDE (foo = .);
  OVERLAY : NOCROSSREFS
  {
    .foo_v0 { *(.s_foo_v0) }
    .foo_v1 { *(.s_foo_v1) }
    .foo_v2 { *(.s_foo_v2) }
  }

  .fini : {
    ...
  }

We've defined an overlay with three ELF sections as members. NOCROSSREFS means the linker will flag an error if one of the overlay sections references a symbol in one of the other sections.

This script is passed to the linker using a -T argument. If not using a separate linking step, pass "-Wl,-Tld.script" to gcc instead. If we disassemble the resulting binary we see all three routines are linked at the same address:

$ objdump -d ./a.out
00000000004006e0 :
  4006e0: 31 c0                 xor    %eax,%eax
  4006e2: c3                    retq   

00000000004006e0 :
  4006e0: b8 01 00 00 00        mov    $0x1,%eax
  4006e5: c3                    retq   

00000000004006e0 :
  4006e0: b8 02 00 00 00        mov    $0x2,%eax
  4006e5: c3                    retq   


$ nm --numeric-sort ./a.out
0000000000400594 T main
00000000004006e0 A foo
00000000004006e0 T foo_v0
00000000004006e0 T foo_v1
00000000004006e0 T foo_v2

 
Commence Handwaving

illustration of multiple variants of foo()At this point I have to merely describe what would happen next, as I don't have sample code to show. The target CPU has some mechanism to load its code into memory. It might bootstrap itself using a boot loader, or it might be loaded by an external supervisor CPU. This loader would need to decide which of the overlay sections to load. It might be as simple as a naming convention, for example having platform-specific ELF sections end in "_v#" and loading only those appropriate for the platform.

What we end up with is the platform independent code calling a symbol named foo, at 0x4006e0. That code is not concerned with what will be found at that address. That it contains different code depending on platform has no impact on the callers.


 
Downsides

There are several downsides to this technique.

  1. It is cumbersome to support many such functions. Each one requires a new OVERLAY block in the linker script, and a different set of __attributes__ in the code. My recommendation is to only do this where performance is really crucial. For the typical init code or uncommon API, a switch (platform) will be fine.
  2. The debugger has no idea what is going on. If you ask gdb to disassemble this routine it will show the correct instructions, but any source line numbers it prints will be wrong.
  3. Source code management tools also have no idea what is going on. Asking for the definition of foo() will either fail, or turn up the wrong code. The developer has to know which version of code will be used on a given platform.

Nonetheless this technique proved useful to me in the past, and I hope it will be useful to someone in the future.


 
Closing Thoughts

STT_GNU_IFUNC is a more general solution to this sort of problem, and far less cumbersome to support. There is slightly more overhead to STT_GNU_IFUNC as it involves an extra call to retrieve the address of the function to call, but I suspect even this could someday be alleviated by dynamically rewriting the PLT (Procedure Linkage Table) with the resulting address. If I recall correctly the Solaris linker does rewrite the PLT, it seems a viable technique.