Sunday, April 20, 2008

The Secret Life of Volatile

The C99 specification says the following about the volatile keyword (section 6.7.3):

An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.
 
A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be "optimized out" by an implementation or reordered except as permitted by the rules for evaluating expressions.

Now perhaps that explanation is sufficient for some of you to have the complete picture of when to use the volatile declaration. If so, you can stop reading now. You probably don't need to bother with any of my other articles, either.

Instead of trying to understand what volatile means by starting from the C language level, I find it more illustrative to look at the differences in the assembly code gcc generates. The gcc developers are the type of people who can read that specification and understand its ramifications. By looking at what the compiler produces, we can examine some of the problem areas which the volatile declaration can resolve.


 

Volatile Effect #1: Memory References

Let's start with a trivial example: a function which will busy-wait until a flag variable becomes non-zero. The flag variable would be set by another thread, or an interrupt or signal handler, to allow this function to break out of the loop.

int flag = 0;
void foo(void) {
    while (flag) {
    }
}

The resulting MIPS assembly is:

00: lw v0,0(gp) load address of flag into register v0
04: lw v0,0(v0) load value of flag into register v0
08: bnez v0,8>-< branch if nonzero, to 0x8 (to itself!)
0c: nop No-op, in branch delay slot
10: jr ra jump to return address (return to caller)
14: nop No-op, in branch delay slot

The instructions shown above will load from the flag variable exactly once, into a register. It will then loop until contents of that register become non-zero ... which cannot possibly happen because nothing else is ever loaded into the register. This seems silly, but the compiler is doing exactly what it has been told. In general when there is a loop, you want the loop condition to be stored in a register rather than fetched from memory. Otherwise every "for (i = 0; i < SOME_BIG_NUMBER; i++)" would result in useless memory references to repeatedly fetch the value of i.

This example has another interesting tidbit which will be important in some of the later examples: the branch delay slot. In order to run at higher clock frequencies, processor designs begin working on the next instruction before the previous instruction has completely finished. This is called pipelining. Branch instructions are particularly difficult to pipeline: it takes a while to figure out whether the branch will be taken or not, to figure out what the next instruction will be. Thus MIPS (and many other architectures) have a concept of a branch delay slot: the instruction immediately following the branch will always be executed, regardless of whether the branch is taken or not taken.

Now we'll examine the effect of adding a volatile modifier to the flag:

volatile int flag = 0;
void foo(void)
{
    while (flag) {
    }
}

The resulting MIPS assembly is:

00: lw v0,0(gp)<-+ load address of flag into register v0
04: lw v0,0(v0)  | load value of flag into register v0
08: bnez v0,0--+ branch if nonzero, to 0x0
0c: nop No-op, in branch delay slot
10: jr ra jump to return address (return to caller)
14: nop No-op, in branch delay slot

Now the loop branches back to load the value of flag from memory into the register for each iteration through the loop. If the value of flag is changed by some external entity, the loop will properly terminate. Actually, the loop not only reloads the value of flag from memory, it also reloads the address of flag into the register. Loading the address again is not necessary, I suspect it is the result of something specific to the MIPS backend.

These examples illustrate another interesting tidbit about branch delay slots: it is frequently difficult to find a suitable instruction for them. Not only does it have to be an instruction which will be executed whether the branch is taken or not, its result cannot be involved in deciding the branch as the branch instruction has already executed. In all delay slots in these examples the compiler had to use a NO-OP because it had no useful work to do.


 

Volatile Effect #2: Memory Access Ordering

Lets look at another example, this time involving a pointer to a hardware device:

typedef struct some_hw_regs_s {
    unsigned int words[4];
} some_hw_regs_t;

void foo(void)
{
    some_hw_regs_t  *regs = some_address;

    regs->words[2] = 0x0000dead;
    regs->words[0] = 0x0000f00d;
}

and the resulting MIPS assembly is:

00: li v0,0xf00d load 0xf00d into a staging register
04: li v1,0xdead load 0xdead into a staging register
08: sw v0,0(a0)  store 0xf00d to regs->words[0]
0c: jr ra  return to caller
10: sw v1,8(a0)  store 0xdead to regs->words[2], in branch delay slot

The problem here is more subtle: in the C code we set words[2] first. In the generated assembly, words[0] is set first. The compiler reversed the order of the memory operations. Why did it do so? I have no idea, it may be that storing to incrementing memory addresses is faster on some architectures.

When operating on data structures in RAM, changing the order of writes is generally harmless. Even with multiple threads running on independent processors there should be some form of locking to prevent other threads from seeing the structure in an inconsistent state. However if the structure is actually a memory-mapped hardware device, the order of accesses likely does matter. If the hardware registers are written in the wrong order the device may not function or, even worse, perform the wrong operation.

Now we'll examine the effect of adding the volatile modifier:

typedef struct some_hw_regs_s {
    unsigned int words[4];
} some_hw_regs_t;

void foo(void)
{
    volatile some_hw_regs_t *regs = some_address;
    regs->words[2] = 0x0000dead;
    regs->words[0] = 0x0000f00d;
}

The resulting MIPS assembly is:

00: li v0,0xdead load 0xdead into a staging register
04: li v1,0xf00d load 0xf00d into a staging register
08: sw v0,8(a0)  store 0xdead to regs->words[2]
0c: jr ra return to caller
10: sw v1,0(a0)  store 0xf00d to regs->words[0], in branch delay slot

Now words[2] is written first, exactly as intended.


 

Volatile Effect #3: Write coalescing

One final example, again involving a pointer to a hardware device:

typedef struct some_hw_regs_s {
    unsigned int words[4];
} some_hw_regs_t;

void foo(void)
{
    some_hw_regs_t *regs = some_address;

    regs->words[2] = 0x0000dead;
    regs->words[2] = 0x0000beef;
    regs->words[2] = 0x0000f00d;
}

We're writing three different values in quick succession to the same memory location. This would be a weird thing to do when writing to memory, but when writing to a memory-mapped hardware device it is quite common. Many hardware devices have a command register, which the software writes to for every operation it wishes to initiate.

00: li v0,0xf00d load 0xf00d into register
04: jr ra return to caller
08: sw v0,8(v1) store 0xf00d to memory, in branch delay slot

There is only one store instruction, for 0xf00d. The compiler determined that the first two writes were useless as they are overwritten by the final one, so it optimized them away. This would be fine if writing to memory, but would cause great consternation in a hardware device driver.

As you can probably anticipate by now, adding the volatile modifier to the pointer will resolve the issue:

typedef struct some_hw_regs_s {
    unsigned int words[4];
} some_hw_regs_t;

void foo(void)
{
    volatile some_hw_regs_t *regs = some_address;
    regs->words[2] = 0x0000dead;
    regs->words[2] = 0x0000beef;
    regs->words[2] = 0x0000f00d;
}

The resulting MIPS assembly is:

00: li v0,0xdead load 0xdead into register
04: sw v0,8(a0) store 0xf00d to regs->words[2]
08: li v1,0xbeef load 0xbeef into register
0c: li v0,0xf00d load 0xf00d into another register
10: sw v1,8(a0) store 0xbeef to regs->words[2]
14: jr ra return to caller
18: sw v0,8(a0) store 0xf00d to regs->words[2], in branch delay slot

With a volatile modifier all three values will be written to regs->words[2], in the order they appear in the C code.


 
The Limits of Volatile

A volatile modifier tells the compiler not to optimize away reads or writes to the variable, but there are other aspects of the system you need to be aware of when accessing hardware devices. For example, the CPU data cache should be bypassed when reading from a hardware device, as stale data cached from a previous read is generally not useful. Marking a pointer as volatile does not automatically make it non-cacheable. The compiler has no control over whether a particular address is cacheable or not, this must be specified when creating the memory mapping. For Linux systems, the kernel driver which exports the mmap() entry point has to create a noncacheable mapping.

Similarly, most modern system designs contain write buffering between the CPU and its memory bus. The CPU can perform a write and continue to subsequent instructions before the write has actually made it all the way to its destination. Adding the volatile modifier does not automatically insert the appropriate memory barrier instructions, these must be handled manually by the developer.


 
Update

Commenting on reddit, dmh2000 suggested two other articles on this topic for your edification and bemusement. I found them both interesting, so I'll pass them on here as well:

I also updated the formatting of this article, because it looked terrible in the RSS feed. I'm still figuring out this newfangled CSS thing all the youngsters seem to like.

Update 2/2009: Commenting on a different article, Tom Ziomek pointed out an excellent paper to be published in the Proceedings of the Eighth ACM and IEEE International Conference on Embedded Software. Extensive testing of the volatile keyword against a number of different C compilers shows that very often, incorrect code is generated. Proceed with caution.