Monday, January 31, 2011

Updating the Zodiac

Noting the impact of date corrections in stoking interest, Astrologers have announced further changes to the Zodiac for 2012. Instead of just 12 signs roughly corresponding to a month each there will be 8,766 signs, one for each hour of the year. The Hubble space telescope has provided ample stellar data to make this possible. Additionally to keep the Zodiac fresh, periodically old signs will be retired and new ones introduced.

Companies wishing to sponsor their logo as a new Zodiacal symbol should contact the International Union of Astrology and Tarot. One, five, and twenty year plans are available.

Hubble starfield image superimposed with decepticon, tron recognizer, wizard hat

Thursday, January 27, 2011

C++ POD Member Handling

I always mess up the initialization of plain old data fields in C++. Always. Maybe by writing it up, I'll finally get it right.

Plain Old Data is essentially anything a regular C compiler could compile. That is:

  • integer and floating point numbers (including bool, though it isn't in C)
  • enums
  • pointers, including pointers to objects and pointers to functions
  • some aggregate data structures (structs, unions, and classes)

A struct, union, or class is treated as plain old data if it has only the default constructor and destructor, has no protected or private member variables, does not inherit from a base class, and has no virtual functions. I suspect most C++ programmers have an intuitive feel for when a class behaves like an object and when its just a collection of data. That intuition is pretty good in this case.

The default constructor for plain old data leaves it uninitialized. An explicit constructor sets it to zero.

class Foo {
  Foo() {}
  int a_;
Result: a_ is uninitialized.
class Foo {
  Foo() : a_() {}
  int a_;
Result: the member corresponding to a_ is zeroed.
Were it a structure, the entire thing would be zero.

People are often confused by the first point, that member POD fields are left uninitialized unless specifically listed in the initializer list. This is not the same as for member objects, which call the default constructor. Making this even more confusing, when a process starts up any pages it gets from the OS will be zeroed out to prevent information leakage. So if you look at the first few objects allocated, there is a better than average chance that all of the member variables will be zero. Unfortunately once the process has run for a while and dirtied some of its own pages, it will start getting objects where the POD variables contain junk.

Struct initializers

Putting a POD struct in the initializer list results in zeroing the struct. If the struct needs to contain non-zero data, C++0x adds a useful capability:

struct bar {
  int y;
  int z;

class Foo {
  Foo() : b_({1, 2}) {}
  struct bar b_;

Recent versions of gcc implement this handling, though a warning will be issued unless the -std=c++0x or -std=gnu++0x command line flag is given.

Monday, January 24, 2011

Automotive Easter Eggs

Automotive software has some great opportunities for easter eggs.

OFF ON CAN Richen fuel mixture 20%
VOL ↑ VOL ↑ VOL ↑ VOL ↓ HAZARD Tune audio to Sirius headbanger channel
HORN Frickin' lasers!

Steering wheel control buttons

Saturday, January 22, 2011

IPv4 Addresses Almost Depleted

According to both and, the IANA is within days of allocating its last block of IPv4 addresses to the regional internet registries. In fact it might have already happened, though there has been no announcement of it yet.

This means the global pool of IPv4 addresses has been exhausted, and all remaining IPv4 addresses have been allocated to a particular region of the planet. The regional authorities will start to run out in 2011 and 2012.

IPv6. It's on.

Thursday, January 13, 2011

Microsoft and ARM

In July 2010 Microsoft signed an architecture license with ARM. In January 2011 Microsoft announced that Windows 8 will run on ARM CPUs. So the license was purchased to support the Windows development effort, right?

I really don't think so.

Porting Windows to a new processor architecture is a massive undertaking. To its credit, Microsoft has maintained the discipline to keep the OS from becoming too entangled with the underlying platform. At various points in its history Windows NT has run on Alpha, MIPS, PowerPC, and Itanium. All but Itanium are long discontinued, and given the enormous codebase and inertia adding a new instruction set would take quite a bit of effort. If Microsoft needed an architectural license to proceed, they needed it more than 6 months before a public demonstration of the result.

Additionally an architectural license is not required to port software to ARM, not even for software as extensive as NT. ARM will provide the necessary support via other, less spectacular arrangements. An architectural license allows the licensor to develop their own implementation of the instruction set, either completely independently or as a substantial modification to a core supplied by ARM. Most producers of ARM chips don't have an architectural license; they don't need one to add peripherals and coprocessors around an unmodified ARM core.

Microsoft is engaging with ARM on multiple fronts, and as it involves Windows (and Office) it would be at the CxO level. I think the ARM license is for Xbox, not for Windows Mobile and not for the NT port. In other products Microsoft relies on hardware partners, which would seriously complicate an effort to introduce a custom CPU. Xbox is the one place where Microsoft produces its own platform in volumes large enough to warrant custom ASIC development; they rely on contract manufacturers to build it, but the design and finished product is unique to Microsoft.

XCPU from Xbox360-S

Developing a customized ARM processor isn't easy, but it isn't unapproachably difficult either. The current Xbox relies on a PowerPC processor from IBM, but PPC is increasingly being relegated to the very low and very high ends of the market. Embedded controllers don't have the needed processing power, while supercomputer CPUs are too expensive and too hot. Xbox has already changed CPUs once, from an x86 in the original to the current PPC. Microsoft has to be weighing the alternatives of completely funding a suitable PowerPC core design, or switching to a different architecture with more presence in the midrange. Nowadays that means x86 or ARM. I think they've chosen ARM, and previously speculated what a CPU designed specifically for Xbox might look like.

Wednesday, January 5, 2011

Overlays Not Yet Extinct

In 2010 both Ian Lance Taylor and Dave Miller wrote about STT_GNU_IFUNC, an ELF symbol type supported by the GNU compiler and linker. These symbols are functions, but do not appear at a fixed address. Instead STT_GNU_IFUNC is a function which returns an address to the function you actually want to call. STT_GNU_IFUNC solves a common problem in supporting platform variations which are similar enough to use the same binary, but different enough to benefit from specific optimizations. A good example is block copy operations like memcpy(). Depending on the CPU one might be able to use SIMD instructions like SSE or MMX, a block move engine in a cache controller, or one of a number of different unrolled loops optimized for specific CPU pipelines.

STT_GNU_IFUNC is the new hotness, but I'm going to describe a different technique for accomplishing similar functionality using the program loader, i.e. the code which loads program text into memory before execution. This might be a boot ROM environment like Das U-Boot, or in the case of a datapath CPU it might be a process running on a separate control CPU. My use of this technique was the latter case, where a deeply embedded CPU in the datapath had its code loaded into its memory by control software running elsewhere.

The key technique in this scheme uses overlays. Yes, overlays.

Set Wayback Machine to 1974

DEC PDP11/20 illustration of program overlays The typical minicomputer in the 1970s had tens to hundreds of kilobytes of memory. Though programs were dramatically smaller, available RAM was still a significant limitation and practices to maximize RAM efficiency were common. One common technique was overlays: multiple segments of code compiled at the same address to be loaded when needed and replaced later. As only one such code segment could be present in memory at a time, the different overlays could not reference each other and ideally would not need to be swapped very often. A common use was to put initialization-specific code in an overlay and replace it with other code once the init was done.

Note that this was not the same as virtual memory. Some minicomputers of the time had virtual addressing hardware, but it was not universal. Overlays were simply regions of memory which the application would overwrite. Modern CPUs and practices make this use of overlays more difficult, with program text generally mapped read-only and large instruction caches which need to be flushed of stale contents. Overlays as a practice are now practically unknown on any system with virtual memory, we rely on the VM system to kick unneeded code out of RAM.

Overlays Today

The GNU linker nonetheless still has support for overlays. A linker script can specify a group of ELF sections to appear at the same address. The linker provides no help in managing the overlay segments in memory, this is left entirely to the developer. We will use this linker support to provide multiple implementations of an API, tuned for different CPUs.

The first step is to implement the code for each CPU. In this example we'll use something trivial, a function which returns a different integer value for each platform. Two attributes are added to each implementation:

  • noinline - we want to choose which version of the function to load at runtime. This cannot work if the compiler inlines the first one it finds.
  • section - each implementation of the function goes in its own ELF section. This will be discussed later.
int foo_v0() __attribute__((noinline,section(".s_foo_v0")));
int foo_v0() {
  return 0;

int foo_v1() __attribute__((noinline,section(".s_foo_v1")));
int foo_v1() {
  return 1;

int foo_v2() __attribute__((noinline,section(".s_foo_v2")));
int foo_v2() {
  return 2;

illustration of multiple functions in a section Each ELF section can contain exactly one function which is to be called from elsewhere in the code. Defining multiple functions in one ELF section doesn't work with this technique: we rely on placing all versions of the function at the same address. If there are multiple functions they can be at different offsets, so code elsewhere in the program won't have the correct address to call. It would be possible to define multiple functions which are only ever called from other routines in the section, this is left as an exercise to the reader.

Similarly, though the implementations can differ substantially the function signature has to be exactly the same for all variants. They must return the same type and take the same arguments, even if that means some of the variants have an argument which they ignore.

If we examine the object file we can see the sections we defined (other sections omitted for brevity):

$ objdump --section-headers foo.o
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000031  0000000000000000  0000000000000000  00000040  2**2
  3 s_foo_v1      0000000b  0000000000000000  0000000000000000  00000074  2**0
  4 s_foo_v2      0000000b  0000000000000000  0000000000000000  0000007f  2**0
  6 s_foo_v0      0000001d  0000000000000000  0000000000000000  0000009b  2**0


The next step is the crucial one, linking the binary. We're going to use a custom linker script which tells ld to arrange those sections as an overlay. The linker can take only one script as input. To enable our overlay, we must also support everything else the linker normally does for binaries on this platform. If you're not already using a linker script, you need to retrieve the default script for your platform using ld --verbose and look for a SECTIONS block in which to add the handling. A snippet of my linker script is shown here, with the added text bolded.

  ... bunch of stuff ...

  .text : {

  PROVIDE (foo = .);
    .foo_v0 { *(.s_foo_v0) }
    .foo_v1 { *(.s_foo_v1) }
    .foo_v2 { *(.s_foo_v2) }

  .fini : {

We've defined an overlay with three ELF sections as members. NOCROSSREFS means the linker will flag an error if one of the overlay sections references a symbol in one of the other sections.

This script is passed to the linker using a -T argument. If not using a separate linking step, pass "-Wl,-Tld.script" to gcc instead. If we disassemble the resulting binary we see all three routines are linked at the same address:

$ objdump -d ./a.out
00000000004006e0 :
  4006e0: 31 c0                 xor    %eax,%eax
  4006e2: c3                    retq   

00000000004006e0 :
  4006e0: b8 01 00 00 00        mov    $0x1,%eax
  4006e5: c3                    retq   

00000000004006e0 :
  4006e0: b8 02 00 00 00        mov    $0x2,%eax
  4006e5: c3                    retq   

$ nm --numeric-sort ./a.out
0000000000400594 T main
00000000004006e0 A foo
00000000004006e0 T foo_v0
00000000004006e0 T foo_v1
00000000004006e0 T foo_v2

Commence Handwaving

illustration of multiple variants of foo()At this point I have to merely describe what would happen next, as I don't have sample code to show. The target CPU has some mechanism to load its code into memory. It might bootstrap itself using a boot loader, or it might be loaded by an external supervisor CPU. This loader would need to decide which of the overlay sections to load. It might be as simple as a naming convention, for example having platform-specific ELF sections end in "_v#" and loading only those appropriate for the platform.

What we end up with is the platform independent code calling a symbol named foo, at 0x4006e0. That code is not concerned with what will be found at that address. That it contains different code depending on platform has no impact on the callers.


There are several downsides to this technique.

  1. It is cumbersome to support many such functions. Each one requires a new OVERLAY block in the linker script, and a different set of __attributes__ in the code. My recommendation is to only do this where performance is really crucial. For the typical init code or uncommon API, a switch (platform) will be fine.
  2. The debugger has no idea what is going on. If you ask gdb to disassemble this routine it will show the correct instructions, but any source line numbers it prints will be wrong.
  3. Source code management tools also have no idea what is going on. Asking for the definition of foo() will either fail, or turn up the wrong code. The developer has to know which version of code will be used on a given platform.

Nonetheless this technique proved useful to me in the past, and I hope it will be useful to someone in the future.

Closing Thoughts

STT_GNU_IFUNC is a more general solution to this sort of problem, and far less cumbersome to support. There is slightly more overhead to STT_GNU_IFUNC as it involves an extra call to retrieve the address of the function to call, but I suspect even this could someday be alleviated by dynamically rewriting the PLT (Procedure Linkage Table) with the resulting address. If I recall correctly the Solaris linker does rewrite the PLT, it seems a viable technique.