Coding Relic: gdb lies to you

Because I'm a terrible programmer, I spend a lot of time in gdb. gdb is a fantastic tool that has (so far) kept me from getting fired, but it has its limitations. Today, Gentle Reader, we will talk about one of the particularly vexing limitations.

Not Always Correct, but Never Uncertain
Have you ever opened a core and immediately spotted the problem: "Oh, foo() was called with a bad argument. That isn't a valid widget ID." So you dig around in the data structures which were used to call foo(), but you cannot see how it could possibly have been called with the argument gdb is showing?

You're right, the program did not call foo with a bizarre argument. gdb is displaying incorrect information. Lets take a simple example:

int testgdb()
{
    return foo(123);
}

int foo(int a)
{
    return bar(456);
}

int bar (int a)
{
    *((int *)NULL) = 0;  /* Deliberately crash */
    return a;
}

foo is called with a constant argument of 123. foo calls bar(456), and bar deliberately segfaults. Now lets examine the backtrace:

(gdb) bt
#0  0x80045a1c in bar (a=456) at foo.c:4
#1  0x80045a30 in foo (a=456) at foo.c:10
#2  0x80045a4c in testgdb () at foo.c:15

Weird, huh? How did foo get called with an argument of 456? The answer, of course, is that it didn't. It was called with 123 as its argument, and gdb is displaying incorrect information.

More MIPS Assembly, Please
Because it is one of my favorite techniques, lets disassemble the object code to see why this happens. This was compiled for MIPS32 using gcc -O2, but with inlining disabled (because we need to have function calls if we're going to talk about function arguments).

We'll start with testgdb():

<testgdb>:
addiu sp,sp,-24	push a stack frame
sw ra,16(sp)	store the return address to the stack
jal <foo>	jump to foo()
li a0,123	pass 123 as the first argument to foo (in delay slot)
lw ra,16(sp)	load return address back from the stack
jr ra	return to the caller
addiu sp,sp,24	pop the stack frame

MIPS passes arguments in CPU registers, and this is crucial to understanding why gdb sometimes displays the arguments incorrectly. Every CPU architecture has a set of "calling conventions" of how the compiler should pass arguments to functions. A long time ago in architectures like VAX and 680x0, arguments would be pushed on the stack. Since then CPUs have gotten dramatically faster while memory speed has improved less rapidly, so modern calling conventions pass arguments in registers whenever possible. The original calling conventions for x86 used the stack but the more recent "fastcall" passes two arguments in registers, while x86_64 uses registers for up to four arguments.

In the assembly example above the argument to foo is loaded into register a0, using a "load immediate" instruction. This instructions happens to be in a delay slot: with MIPS the instruction immediately following a branch is always executed.

<foo>:
addiu sp,sp,-24	push a stack frame
sw ra,16(sp)	save return address to the stack
jal <bar>	jump to bar()
li a0,456	pass 456 as first argument to bar (in delay slot)
lw ra,16(sp)	after returning from bar(), load return address
jr ra	return to the caller
addiu sp,sp,24	pop the stack frame

foo() pushes a stack frame and loads 456 into a0 in order to call bar(). The crucial thing to notice is that the current value of a0 was not saved anywhere before overwriting it with 456. The compiler determined that the original value was no longer required for any further computation, and did not need to be saved. So the "123" value stored in a0 has been irrevocably lost. foo next calls bar, and crashes.

In the gdb backtrace the function arguments will be displayed from the stack if they are available. Otherwise, gdb displays the values of the argument registers and hopes for the best. The arguments in the backtrace are mostly correct because gdb can often pull them from the stack. When compiled -O0 functions always save their argument registers to the stack, so it is only optimized code which can make debugging difficult in this way.

Et tu, gdb?

What can one do about this? Mainly be aware that what the debugger shows is not always correct. I believe this is good practice in general, always be a little bit suspicious of the tools.

Personally I'd find it more helpful if gdb were to display "<argument unknown>" rather than an incorrect value, if it can determine that the argument register has been re-used. I've tried digging through the gdb sources to see if this could be improved, but got lost in mips-tdep.c (reference first paragraph about me being a terrible programmer). Information about register re-use is typically not included in the compiler's debugging information anyway, so gdb may not have sufficient information to know when a register still holds the correct argument value.

Wednesday, July 2, 2008

gdb lies to you