Coding Relic: C Programming

Showing posts with label C Programming. Show all posts

Tuesday, July 12, 2011

CodeSynthesis XSD Data Binding

Nowadays I make a habit of writing up how to use particular tools or techniques for anything which might be useful to reference later. Many techniques I worked on before starting this practice are now lost to me, locked away in proprietary source code at some previous employer.

This post concerns data binding from XML schemas in C++, generating classes rather than manipulating the underlying XML. As its written for Future Me, it might not be so interesting to those who are not Future Me.

Consider the simple XML schema shown below. I aspire to be the Evil Overlord, and am working on the HR system to keep track of my innumerable minions.

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="minion">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
      <xs:element name="rank" type="xs:string"/>
      <xs:element name="serial" type="xs:positiveInteger"/>
    </xs:sequence>
    <xs:attribute name="loyalty" type="xs:float" use="required"/>
  </xs:complexType>
</xs:element>

</xs:schema>

It would be possible to parse documents created from this schema manually, using something like libexpat or Xerces. Unfortunately as the schema becomes large, the likelihood of mistakes in this manual process becomes overwhelming.

I chose instead to work with CodeSynthesis XSD to generate classes from the schema, based mainly on the Free/Libre Open Source Software Exception in their license. This project will eventually be released under an Apache-style license, and all other data binding solutions I found for C++ were either GPL or a commercial license.

Parsing from XML

The generated code provides a number of function prototypes to parse XML from various sources, including iostreams.

std::istringstream agent_smith(
  "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>"
  "<minion xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" "
  "xsi:noNamespaceSchemaLocation=\"schema.xsd\" loyalty=\"0.2\">"
  "<name>Agent Smith</name>"
  "<rank>Member of Minion Staff</rank>"
  "<serial>2</serial>"
  "</minion>");
std::auto_ptr m(NULL);

try {
  m = minion_(agent_smith);
} catch (const xml_schema::exception& e) {
  std::cerr << e << std::endl;
  return;
}

The minion object now contains data members with proper C++ types for each XML node and attribute.

std::cout << "Name: " << m->name() << std::endl
          << "Loyalty: " << m->loyalty() << std::endl
          << "Rank: " << m->rank() << std::endl
          << "Serial number: " << m->serial() << std::endl;

Serialization to XML

Methods to serialize an object to XML are not generated by default, the --generate-serialization flag has to be passed to xsdcxx. This emits another series of minion_ methods, which take output arguments.

int main() {
  minion m("Salacious Crumb", "Senior Lackey", 1, 0.9);
  minion_(std::cout, m);
}

This sends the XML to stdout.

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<minion loyalty="0.9">
  <name>Salacious Crumb</name>
  <rank>Senior Lackey</rank>
  <serial>1</serial>
</minion>

Codesynthesis relies on Xerces-C++ to provide the lower layer XML handling, so all of the functionality of that library is also available to the application.

Thats enough for now. See you later, Future Me.

Thursday, February 3, 2011

Listing Processes with libproc

I recently had to work on functionality to look through /proc/<pid> for information about processes, which would have entailed an annoying amount of file schlepping and string parsing. Fortunately there is procps, a very nice library to makes /proc access work very much like directory access via opendir(). It normalizes the procfs implementations of a number of OSes like Linux and Solaris, so you work with a common data structure and don't have to maintain a bunch of parsing code. However it doesn't abstract /proc very much: you still need to know what it is and what information is in each /proc file to make good use of the facility.

You start with a call to openproc(), which creates a PROCTAB* structure to iterate through running processes.

#include <proc/readproc.h>

int main(int argc, char** argv) {
  PROCTAB* proc = openproc(PROC_FILLMEM | PROC_FILLSTAT | PROC_FILLSTATUS);

A flag argument to openproc() tell it what kind of information you want. The library will skip processing files in /proc if it can.

PROC_FILLMEM	read /proc/<pid>/statm
PROC_FILLCOM	allocate and populate `cmdline'
PROC_FILLENV	allocate and populate `environ'
PROC_FILLUSR	look up user id number, fill in user name
PROC_FILLGRP	look up group id number, fill in group name
PROC_FILLSTATUS	read /proc/<pid>/status
PROC_FILLSTAT	read /proc/<pid>/status
PROC_FILLWCHAN	read function name from /proc/<pid>/statm
PROC_FILLARG	handled identically to PROC_FILLCOM

The PROCTAB is repeatedly passed to readproc(), which populates a proc_t for each running process.

proc_t proc_info;
memset(&proc_info, 0, sizeof(proc_info));
while (readproc(proc, &proc_info) != NULL) {
  printf("%20s:\t%5ld\t%5lld\t%5lld\n",
         proc_info.cmd, proc_info.resident,
         proc_info.utime, proc_info.stime);
}

When done, call closeproc() to release resources.

closeproc(proc);

Some sample output from my system:

  process:  pages utime   stime
   xinetd:    139       1       0
     sshd:    866      10      21
     bash:   1377      28      16
ssh-agent:    208      11       3
  portmap:    158       1       4
rpc.statd:    208       1       0

proc_t

The proc_t contains a great deal of information about the process.

typedef struct proc_t {
  int
    tid,         // (special)     task id, the POSIX thread ID (see also: tgid)
    ppid;        // stat,status   pid of parent process

  unsigned long long
    utime,       // stat          user-mode CPU time accumulated by process
    stime,       // stat          kernel-mode CPU time accumulated by process
    cutime,      // stat          cumulative utime of process and reaped children
    cstime,      // stat          cumulative stime of process and reaped children
    start_time;  // stat          start time of process -- seconds since 1-1-70

  long
    priority,    // stat          kernel scheduling priority
    nice,        // stat          standard unix nice level of process
    rss,         // stat          resident set size from /proc/#/stat (pages)
  ...etc...

The proc_t also contains this maddening little member variable:

  unsigned pcpu; // stat          %CPU usage (is not filled in by readproc)

Instantaneous CPU percentage is commonly desired, but is not tracked by the kernel and is therefore not available anywhere procps can read. Tracking a percentage has to be implemented in the application by taking a snapshot, waiting a little while, and taking another snapshot to learn the utime+stime spent during the interval. This is the reason why top shows all CPU percentages as 0.0% when it starts, and corrects them on the next interval. procps provides a convenient place to store the CPU percentage, but does not implement it in the library.

Thursday, January 27, 2011

C++ POD Member Handling

I always mess up the initialization of plain old data fields in C++. Always. Maybe by writing it up, I'll finally get it right.

Plain Old Data is essentially anything a regular C compiler could compile. That is:

integer and floating point numbers (including bool, though it isn't in C)
enums
pointers, including pointers to objects and pointers to functions
some aggregate data structures (structs, unions, and classes)

A struct, union, or class is treated as plain old data if it has only the default constructor and destructor, has no protected or private member variables, does not inherit from a base class, and has no virtual functions. I suspect most C++ programmers have an intuitive feel for when a class behaves like an object and when its just a collection of data. That intuition is pretty good in this case.

The default constructor for plain old data leaves it uninitialized. An explicit constructor sets it to zero.

Code	Result
class Foo { public: Foo() {} int a_; };	Result: a_ is uninitialized.
class Foo { public: Foo() : a_() {} int a_; };	Result: the member corresponding to a_ is zeroed. Were it a structure, the entire thing would be zero.

People are often confused by the first point, that member POD fields are left uninitialized unless specifically listed in the initializer list. This is not the same as for member objects, which call the default constructor. Making this even more confusing, when a process starts up any pages it gets from the OS will be zeroed out to prevent information leakage. So if you look at the first few objects allocated, there is a better than average chance that all of the member variables will be zero. Unfortunately once the process has run for a while and dirtied some of its own pages, it will start getting objects where the POD variables contain junk.

Struct initializers

Putting a POD struct in the initializer list results in zeroing the struct. If the struct needs to contain non-zero data, C++0x adds a useful capability:

struct bar {
  int y;
  int z;
};

class Foo {
 public:
  Foo() : b_({1, 2}) {}
  struct bar b_;
};

Recent versions of gcc implement this handling, though a warning will be issued unless the -std=c++0x or -std=gnu++0x command line flag is given.

Wednesday, January 5, 2011

Overlays Not Yet Extinct

In 2010 both Ian Lance Taylor and Dave Miller wrote about STT_GNU_IFUNC, an ELF symbol type supported by the GNU compiler and linker. These symbols are functions, but do not appear at a fixed address. Instead STT_GNU_IFUNC is a function which returns an address to the function you actually want to call. STT_GNU_IFUNC solves a common problem in supporting platform variations which are similar enough to use the same binary, but different enough to benefit from specific optimizations. A good example is block copy operations like memcpy(). Depending on the CPU one might be able to use SIMD instructions like SSE or MMX, a block move engine in a cache controller, or one of a number of different unrolled loops optimized for specific CPU pipelines.

STT_GNU_IFUNC is the new hotness, but I'm going to describe a different technique for accomplishing similar functionality using the program loader, i.e. the code which loads program text into memory before execution. This might be a boot ROM environment like Das U-Boot, or in the case of a datapath CPU it might be a process running on a separate control CPU. My use of this technique was the latter case, where a deeply embedded CPU in the datapath had its code loaded into its memory by control software running elsewhere.

The key technique in this scheme uses overlays. Yes, overlays.

Set Wayback Machine to 1974

DEC PDP11/20 illustration of program overlays The typical minicomputer in the 1970s had tens to hundreds of kilobytes of memory. Though programs were dramatically smaller, available RAM was still a significant limitation and practices to maximize RAM efficiency were common. One common technique was overlays: multiple segments of code compiled at the same address to be loaded when needed and replaced later. As only one such code segment could be present in memory at a time, the different overlays could not reference each other and ideally would not need to be swapped very often. A common use was to put initialization-specific code in an overlay and replace it with other code once the init was done.

Note that this was not the same as virtual memory. Some minicomputers of the time had virtual addressing hardware, but it was not universal. Overlays were simply regions of memory which the application would overwrite. Modern CPUs and practices make this use of overlays more difficult, with program text generally mapped read-only and large instruction caches which need to be flushed of stale contents. Overlays as a practice are now practically unknown on any system with virtual memory, we rely on the VM system to kick unneeded code out of RAM.

Overlays Today

The GNU linker nonetheless still has support for overlays. A linker script can specify a group of ELF sections to appear at the same address. The linker provides no help in managing the overlay segments in memory, this is left entirely to the developer. We will use this linker support to provide multiple implementations of an API, tuned for different CPUs.

The first step is to implement the code for each CPU. In this example we'll use something trivial, a function which returns a different integer value for each platform. Two attributes are added to each implementation:

noinline - we want to choose which version of the function to load at runtime. This cannot work if the compiler inlines the first one it finds.
section - each implementation of the function goes in its own ELF section. This will be discussed later.

int foo_v0() __attribute__((noinline,section(".s_foo_v0")));
int foo_v0() {
  return 0;
}

int foo_v1() __attribute__((noinline,section(".s_foo_v1")));
int foo_v1() {
  return 1;
}

int foo_v2() __attribute__((noinline,section(".s_foo_v2")));
int foo_v2() {
  return 2;
}

illustration of multiple functions in a section Each ELF section can contain exactly one function which is to be called from elsewhere in the code. Defining multiple functions in one ELF section doesn't work with this technique: we rely on placing all versions of the function at the same address. If there are multiple functions they can be at different offsets, so code elsewhere in the program won't have the correct address to call. It would be possible to define multiple functions which are only ever called from other routines in the section, this is left as an exercise to the reader.

Similarly, though the implementations can differ substantially the function signature has to be exactly the same for all variants. They must return the same type and take the same arguments, even if that means some of the variants have an argument which they ignore.

If we examine the object file we can see the sections we defined (other sections omitted for brevity):

$ objdump --section-headers foo.o
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000031  0000000000000000  0000000000000000  00000040  2**2
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  3 s_foo_v1      0000000b  0000000000000000  0000000000000000  00000074  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  4 s_foo_v2      0000000b  0000000000000000  0000000000000000  0000007f  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  6 s_foo_v0      0000001d  0000000000000000  0000000000000000  0000009b  2**0
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE

OVERLAY : NOCROSSREFS

The next step is the crucial one, linking the binary. We're going to use a custom linker script which tells ld to arrange those sections as an overlay. The linker can take only one script as input. To enable our overlay, we must also support everything else the linker normally does for binaries on this platform. If you're not already using a linker script, you need to retrieve the default script for your platform using ld --verbose and look for a SECTIONS block in which to add the handling. A snippet of my linker script is shown here, with the added text bolded.

SECTIONS
{
  ... bunch of stuff ...

  .text : {
    ...
  }

  PROVIDE (foo = .);
  OVERLAY : NOCROSSREFS
  {
    .foo_v0 { *(.s_foo_v0) }
    .foo_v1 { *(.s_foo_v1) }
    .foo_v2 { *(.s_foo_v2) }
  }

  .fini : {
    ...
  }

We've defined an overlay with three ELF sections as members. NOCROSSREFS means the linker will flag an error if one of the overlay sections references a symbol in one of the other sections.

This script is passed to the linker using a -T argument. If not using a separate linking step, pass "-Wl,-Tld.script" to gcc instead. If we disassemble the resulting binary we see all three routines are linked at the same address:

$ objdump -d ./a.out
00000000004006e0 :
  4006e0: 31 c0                 xor    %eax,%eax
  4006e2: c3                    retq   

00000000004006e0 :
  4006e0: b8 01 00 00 00        mov    $0x1,%eax
  4006e5: c3                    retq   

00000000004006e0 :
  4006e0: b8 02 00 00 00        mov    $0x2,%eax
  4006e5: c3                    retq   


$ nm --numeric-sort ./a.out
0000000000400594 T main
00000000004006e0 A foo
00000000004006e0 T foo_v0
00000000004006e0 T foo_v1
00000000004006e0 T foo_v2

Commence Handwaving

illustration of multiple variants of foo() At this point I have to merely describe what would happen next, as I don't have sample code to show. The target CPU has some mechanism to load its code into memory. It might bootstrap itself using a boot loader, or it might be loaded by an external supervisor CPU. This loader would need to decide which of the overlay sections to load. It might be as simple as a naming convention, for example having platform-specific ELF sections end in "_v#" and loading only those appropriate for the platform.

What we end up with is the platform independent code calling a symbol named foo, at 0x4006e0. That code is not concerned with what will be found at that address. That it contains different code depending on platform has no impact on the callers.

Downsides

There are several downsides to this technique.

It is cumbersome to support many such functions. Each one requires a new OVERLAY block in the linker script, and a different set of __attributes__ in the code. My recommendation is to only do this where performance is really crucial. For the typical init code or uncommon API, a switch (platform) will be fine.
The debugger has no idea what is going on. If you ask gdb to disassemble this routine it will show the correct instructions, but any source line numbers it prints will be wrong.
Source code management tools also have no idea what is going on. Asking for the definition of foo() will either fail, or turn up the wrong code. The developer has to know which version of code will be used on a given platform.

Nonetheless this technique proved useful to me in the past, and I hope it will be useful to someone in the future.

Closing Thoughts

STT_GNU_IFUNC is a more general solution to this sort of problem, and far less cumbersome to support. There is slightly more overhead to STT_GNU_IFUNC as it involves an extra call to retrieve the address of the function to call, but I suspect even this could someday be alleviated by dynamically rewriting the PLT (Procedure Linkage Table) with the resulting address. If I recall correctly the Solaris linker does rewrite the PLT, it seems a viable technique.

Thursday, December 16, 2010

Code Snippet: libarchive

Paper Tape libarchive is a library to handle tar, zip, cpio, pax, and many other archive formats. It uses a "walk through the archive" programming model, generally eschewing random access. Diving straight into it, we'll open a tar archive and list the files therein.

#include <archive.h>
#include <archive_entry.h>

archive.h contains the APIs for working with archives, archive_entry.h deals with files within the archive.

struct archive* archive = archive_read_new();
assert(archive != NULL);

archive_read_new() allocates the data structure to read an archive. It is only allocated in memory, and does not open a file on disk or tape. Later we'll open the file and associate it with the data structure.

if ((archive_read_support_compression_all(archive) != ARCHIVE_OK) ||
    (archive_read_support_format_all(archive) != ARCHIVE_OK))) {
  archive_read_finish(archive);
  // Error handling
}

There are a series of APIs like archive_read_support_compression_bzip2() or archive_read_support_format_tar() which can restrict the set of allowed formats, but here we set both the compression filter and format to anything libarchive supports. libarchive relies on external libraries for some things, such as libz for gzip, so the choices when building libarchive will restrict the formats it can support.

if (archive_read_open_filename(archive, "foobar.tgz", 8192) != ARCHIVE_OK) {
  archive_read_finish(archive);
  // Error handling
}

Here we've asked libarchive to open a file by name. There are also archive_read_open_FILE() and archive_read_open_fd() APIs to pass in a FILE* or file descriptor, respectively.

"8192" is the block size, which is used for a few archive formats like tar. Nonetheless libarchive does a good job of determining the real block size if it is incorrect. There is mention of removing the block size parameter in a future version of the library and relying solely on inferring it from the file.

struct archive_entry *entry;
while (archive_read_next_header(archive, &entry) == ARCHIVE_OK) {
  printf("file = %s\n", archive_entry_pathname(entry));
}

This is the main point of the routine: iterate through the entries in the file printing filenames, skipping over the data in between. Many archive formats lack a complete table of contents, instead allowing appends to extend the archive ad hoc. archive_read_next_header() will often have to seek through the file to find the next entry. If the file is located on a remote filesystem, this can be slow.

  archive_read_finish();

When we're done, archive_read_finish() frees the resources allocated by archive_read_new().

Reading File Contents

To extract a file from the archive you first iterate through archive_read_next_header() until you find one with the filename you want. I'll skip the code which does this as it is identical to that shown above, and start from the point where *entry points to the file we want.

size_t total = archive_entry_size(entry);
char buf[MY_BUF_SIZE];
size_t len_to_read = (total < sizeof(buf)) ? total : sizeof(buf);
ssize_t size = archive_read_data(archive, buf, len_to_read);
if (size <= 0) {
  // Error handling
}

archive_read_data() reads the content of *entry into a buffer. There are several variations such as archive_read_data_block() which additionally takes an offset, and archive_read_extract() which reads data and writes it to a file on disk.

Writing Files

Writing to an archive uses a similar set of APIs as reading.

  struct archive* archive = archive_write_new();
  assert(archive != NULL);

archive_write_new() allocates the data structure to track an archive. It does not create anything on disk

if ((archive_write_set_compression_gzip(archive) != ARCHIVE_OK) ||
    (archive_write_set_format_ustar(archive) != ARCHIVE_OK) ||
    (archive_write_open_filename(archive, "foobar.tgz") != ARCHIVE_OK)) {
  // Error handling
}

Where the read APIs allow "all" as a choice, writing an entry requires you to pick a format. Here I've chosen a tar.gz, and written it to foobar.tgz.

struct archive_entry* entry = archive_entry_new();
assert(entry != NULL);

struct timespec ts;
assert(clock_gettime(CLOCK_REALTIME, &ts) == 0);

archive_entry_set_pathname(entry, filename);
archive_entry_set_size(entry, contents_len);
archive_entry_set_filetype(entry, AE_IFREG);
archive_entry_set_perm(entry, 0444);
archive_entry_set_atime(entry, ts.tv_sec, ts.tv_nsec);
archive_entry_set_birthtime(entry, ts.tv_sec, ts.tv_nsec);
archive_entry_set_ctime(entry, ts.tv_sec, ts.tv_nsec);
archive_entry_set_mtime(entry, ts.tv_sec, ts.tv_nsec);

Here we create the metadata for a file in the archive, populating it with permissions and timestamps. Not all archive formats support all of these timestamps, but it seems a good idea to populate them in case a different format is chosen later.

int rc = archive_write_header(archive, entry);
archive_entry_free(entry);
entry = NULL;
if (ARCHIVE_OK != rc) {
  // Error handling
}

Once the metadata has been written to the archive, the archive_entry is no longer needed.

size_t written = archive_write_data(archive, contents, contents_len);
if (written != contents_len) {
  // Error handling
}

archive_write_finish(archive);

Finally, we write the data. contents is a pointer to a buffer in memory, contents_len is its length in bytes. archive_write_data() can be called multiple times, each will append its contents at the end of the last. There is no random access API with an offset parameter.

Closing Thoughts

libarchive APIs are designed to allow use with either disk or tape. There are no APIs to overwrite bytes in the middle of a file, because tape drives cannot do that without corrupting adjacent data. There is an alternate set of APIs designed for disk in archive_read_disk and archive_write_disk, though I see relatively little difference in them other than accessing the uid/gid of the archive itself.

I hope you find this useful.

Tuesday, November 23, 2010

Code Snippet: ctemplate

Content management systems like Django typically do not embed HTML strings directly in code. They separate the presentation of the data out from the code which assembles the data by using templates. Here is an example Django template taken from a small App Engine project of mine:

<div class="resultsSectionItems">
{% for comment in friend.comments|slice:":29" %}
  <div><a href="http://friendfeed.com/e/{{ comment.entryObj.entryId }}" ... etc
  <span class="commentText">{{ comment.commentText }}</span>
  </div>

Each "{%" block is a template command. This template iterates through comments, creating links.

Templates maintain a separation of responsibilities. The code prepares data structures populated with the data to display. The template iterates over those structures, generating and formatting output. Templates are widespread within content management systems, but they can also be useful in embedded systems work. Some examples:

Presenting common system data to CLI, embedded web server, and SNMP backends.
Allowing an OEM to customize the output to include their logos and branding, without having to change code.
Easier support for multiple languages, as most text should be in templates not code. Templates also tend to compress well, lowering the footprint of internationalization.

Most CMSes are written in Python/Ruby/Java/Perl or other high level languages. There is an opensource C++ templating package by Craig Silverstein at Google called ctemplate. Here is an example which produces a portion of the Apache httpd.conf file based on internal configuration data:

# This file is autogenerated from configuration. Changes will be lost
# after the next config change.

{{#DIR}}<Directory {{PATH}}>
{{#OPTIONS}}  Options {{#OPT}}{{VAL}} {{/OPT}}{{/OPTIONS}}
  Order {{ORDER}}
</Directory>
{{/DIR}}

Code dealing with populating variables in dictionaries is bolded in the example below, as this is the key point of using ctemplate.

#include <assert.h>
#include <ctemplate/template.h>
#include <iostream>
#include <list>

void apache_example() {
  // Apache <Directory> blocks to create
  struct ApacheDir {
    const char* path;
    std::list<const char*> options;
    bool deny;
  } apache_dirs[] = {
    {"/var/www", {"FollowSymLinks"}, false},
    {"/SecretFeature", {"ExecCGI", "-Indexes"}, true},
    {"/Tetris", {}, false}
  };

  ctemplate::TemplateDictionary dict("APACHE_EXAMPLE");
  int num_dirs = sizeof(apache_dirs) / sizeof(apache_dirs[0]);
  for (int i = 0; i < num_dirs; ++i) {
    struct ApacheDir* entry = &apache_dirs[i];
    ctemplate::TemplateDictionary* sub_dict = dict.AddSectionDictionary("DIR");

    assert(entry->path != NULL);
    sub_dict->SetValue("PATH", entry->path);

    std::list<const char*>::const_iterator li;
    for (li = entry->options.begin(); li != entry->options.end(); ++li) {
      sub_dict->SetValueAndShowSection("OPT", *li, "OPTIONS");
    }
    sub_dict->SetValue("ORDER", (entry->deny ? "deny,allow" : "allow,deny"));
  }

  std::string output;
  ctemplate::ExpandTemplate("apache.tpl", ctemplate::DO_NOT_STRIP,
                            &dict, &output);
  std::cout << output << std::endl;
}

The example shows some interesting features beyond simple variable substitution. The OPTIONS section is only displayed if there are options present, by using SetValueAndShowSection() in the code. The output of running this code is:

# This file is autogenerated from configuration. Changes will be lost
# after the next config change.

<Directory /var/www>
  Options FollowSymLinks 
  Order allow,deny
</Directory>

<Directory /SecretFeature>
  Options ExecCGI -Indexes 
  Order deny,allow
</Directory>

<Directory /Tetris>

  Order allow,deny
</Directory>

Like many other templating systems, ctemplate can apply modifiers to expanded variables. The builtin modifiers mostly concern escaping of HTML, XML, or JSON to avoid common security issues like cross site scripting. It is possible to supply additional variable modifiers by subclassing ctemplate::TemplateModifier. The App Engine example at the top of this article pipes variables through a slice statement to truncate strings to a specific length. We can create equivalent functionality for ctemplate by subclassing the TemplateModifier. The Modify() method is shown in bold, as this is the key part of the implementation.

class MaxlenModifier : public ctemplate::TemplateModifier {
  virtual void Modify(const char* in, size_t inlen,
                      const ctemplate::PerExpandData* per_expand_data,
                      ctemplate::ExpandEmitter* outbuf,
                      const std::string& arg) const {
    unsigned int maxlen;
    if ((sscanf(arg.c_str(), "=%u", &maxlen) == 1) && (maxlen <= inlen)) {
      outbuf->Emit(std::string(in, maxlen));
    } else {
      outbuf->Emit(in);
    }
  }
};

void modifier_example() {
  MaxlenModifier* maxlen = new MaxlenModifier();
  if (!(ctemplate::AddModifier("x-maxlen=", maxlen))) {
    printf("AddModifier failed\n");
    exit(1);
  }

  ctemplate::TemplateDictionary dict("MAXLEN_TEST");
  dict.SetValue("LONGSTRING", "0123456789abcdefghijklmnopqrstuvwxyz0123456789");
  std::string output;
  ctemplate::ExpandTemplate("maxlen.tpl", ctemplate::DO_NOT_STRIP,
                            &dict, &output);
  std::cout << output << std::endl;
}

Our custom modifier is instantiated in the template using x-maxlen=N. Prefixing customer modifiers with "x-" is very strongly encouraged in the ctemplate documentation.

The original string: {{LONGSTRING}}
A maxlen=10  string: {{LONGSTRING:x-maxlen=10}}
A maxlen=20  string: {{LONGSTRING:x-maxlen=20}}
A maxlen=80  string: {{LONGSTRING:x-maxlen=80}}

Here is the output, with the long string truncated to various lengths:

The original string: 0123456789abcdefghijklmnopqrstuvwxyz
A maxlen=10  string: 0123456789
A maxlen=20  string: 0123456789abcdefghij
A maxlen=80  string: 0123456789abcdefghijklmnopqrstuvwxyz

I've found ctemplate to be quite useful, and I hope others do as well.

Thursday, November 18, 2010

Code Snippet: getifaddrs

A few months ago I posted a description of how to use SIOCGIFCONF to retrieve information about interfaces. SIOCGIFCONF is somewhat clunky in that you use an ioctl to find out how many interfaces are present, allocate enough memory to retrieve them all, and then issue another ioctl to actually get the information. To handle the vanishingly small chance that more interfaces will be added during the time you spend allocating memory, a fudge factor of 2x is added to the memory allocation. Because, you know, its not likely the number of interfaces would double.

That was all very silly, and as it turns out in Linux there is a much better API for retrieving information about interfaces: getifaddr(). The call handles memory allocation so you don't have to pass in a buffer of sufficient size, though you do have to call freeifaddrs() afterwards to release the memory. getifaddrs allows each protocol family in the kernel to export information about an interface. The caller has to check the address family of each returned interface to know how to interpret it. For example, AF_INET/AF_INET6 contain the interface address, while AF_PACKET has statistics. Example code for these three families is shown here.

#include <arpa/inet.h>
#include <sys/socket.h>
#include <netdb.h>
#include <ifaddrs.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/if_link.h>

int main(int argc, char *argv[]) {
  struct ifaddrs *ifaddr;
  int family, s;

  if (getifaddrs(&ifaddr) == -1) {
    perror("getifaddrs");
    exit(1);
  }

  struct ifaddrs *ifa = ifaddr;
  for (ifa = ifaddr; ifa != NULL; ifa = ifa->ifa_next) {
    if (ifa->ifa_addr != NULL) {
      int family = ifa->ifa_addr->sa_family;
      if (family == AF_INET || family == AF_INET6) {
        char ip_addr[NI_MAXHOST];
        int s = getnameinfo(ifa->ifa_addr,
                            ((family == AF_INET) ? sizeof(struct sockaddr_in) :
                                                   sizeof(struct sockaddr_in6)),
                            ip_addr, sizeof(ip_addr), NULL, 0, NI_NUMERICHOST);
        if (s != 0) {
          printf("getnameinfo() failed: %s\n", gai_strerror(s));
          exit(1);
        } else {
          printf("%-7s: %s\n", ifa->ifa_name, ip_addr);
        }
      } else if (family == AF_PACKET) {
        struct rtnl_link_stats *stats = ifa->ifa_data;
        printf("%-7s:\n"
               "\ttx_packets = %12u, rx_packets = %12u\n"
               "\ttx_bytes   = %12u, rx_bytes   = %12u\n",
               ifa->ifa_name,
               stats->tx_packets, stats->rx_packets,
               stats->tx_bytes, stats->rx_bytes);
      } else {
        printf("%-7s: family=%d\n", ifa->ifa_name, family);
      }
    }
  }

  freeifaddrs(ifaddr);
  exit(0);
}

On my system the output is as follows (though I've obscured the addresses):

lo     :
        tx_packets =     16714641, rx_packets =     16714641
        tx_bytes   =   1943837629, rx_bytes   =   1943837629
eth0   :
        tx_packets =    102862634, rx_packets =    118537985
        tx_bytes   =   3472339330, rx_bytes   =    698859563
gre0   :
        tx_packets =            0, rx_packets =            0
        tx_bytes   =            0, rx_bytes   =            0
lo     : 127.0.0.1
eth0   : 10.0.0.1
lo     : ::1
eth0   : 1111:1111:1111:1111:a800:1ff:fe00:1111
eth0   : fe80::a800:1ff:fe00:1111%eth0

Thursday, October 21, 2010

Code Snippet: getmntent and statfs

A system which stays up for weeks or months at a time needs to monitor various facets of its operation to alert an operator if something unusual occurs. One of the things which should be monitored is disk space, as a full filesystem tends to expose lots of strange and wonderful failure modes. I suspect such monitoring is commonly implemented by invoking popen("df -k") and parsing the output. An alternative is to use the same calls which df uses: getmntent and statfs.

setmntent and getmntent parse a file listing mounted filesystems, generally /etc/mtab on Linux systems. The getmntent_r variant shown below is a glibc-specific extension which is thread safe, requiring that a block of memory be provided in which to store string parameters like the mount point.

#include <mntent.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/vfs.h>
#include <unistd.h>

int main(void) {
  FILE* mtab = setmntent("/etc/mtab", "r");
  struct mntent* m;
  struct mntent mnt;
  char strings[4096];
  while ((m = getmntent_r(mtab, &mnt, strings, sizeof(strings)))) {
    struct statfs fs;
    if ((mnt.mnt_dir != NULL) && (statfs(mnt.mnt_dir, &fs) == 0)) {
      unsigned long long int size = fs.f_blocks * fs.f_bsize;
      unsigned long long int free = fs.f_bfree * fs.f_bsize;
      unsigned long long int avail = fs.f_bavail * fs.f_bsize;
      printf("%s %s size=%lld free=%lld avail=%lld\n",
             mnt.mnt_fsname, mnt.mnt_dir, size, free, avail);
    }
  }

  endmntent(mtab);
}

This code likely fails when there are stacked filesystems, where multiple filesystems are mounted one atop another on the same directory. This is done for union mounts where a read-only filesystem like squashfs has a read-write filesystem mounted atop it as an overlay. statfs will retrieve only the topmost filesystem at that mount point. I don't have a solution for this, if anyone can provide one in the comments I'll add it as an update here.

Wednesday, September 22, 2010

GCC Function Instrumentation

One of gcc's more obscure features is -finstrument-functions. It was implemented by Cygnus Solutions, presumably as part of a contract for sombody-or-other to deliver something-or-other. When enabled, the compiler will emit calls to __cyg_profile_func_enter() and __cyg_profile_func_exit() at the top and bottom of every function.

Let's examine a simple example which prints the function addresses at entry and exit.

#include <stdio.h>

void __cyg_profile_func_enter(void *this_fn, void *call_site)
                              __attribute__((no_instrument_function));
void __cyg_profile_func_enter(void *this_fn, void *call_site) {
  printf("ENTER: %p, from %p\n", this_fn, call_site);
} /* __cyg_profile_func_enter */

void __cyg_profile_func_exit(void *this_fn, void *call_site)
                             __attribute__((no_instrument_function));
void __cyg_profile_func_exit(void *this_fn, void *call_site) {
  printf("EXIT:  %p, from %p\n", this_fn, call_site);
} /* __cyg_profile_func_enter */

int foo() {
  return 2;
}

int bar() {
  return 1;
}

int main(int argc, char** argv) {
  printf("foo=%d bar=%d\n", foo(), bar());
}

The __cyg_profile_func_enter and exit functions are passed two parameters: the address of the function being entered/exited, and the address from which it was called. Note the use of the no_instrument_function attribute. If not present, then __cyg_profile_func_enter would be instrumented like any other function. Every call would result in calling the instrumentation again, which results in another call, etc etc until it blows the stack. Previously I've used -finstrument-functions to construct a profiler for a CPU whose interrupt structure was not suitable for a sample-based profiler. All of the routines implementing the profiler were labelled no_instrument_function.

Next we'll examine the output, with just enough of the disassembled binary to make sense of it.

$ cc t.c -finstrument-functions
$ ./a.out
ENTER: 0x4005d0 @ 0x2b59e0d471c4 (calling main)
ENTER: 0x40059d @ 0x40060c       (calling foo)
EXIT:  0x40059d @ 0x40060c       (returning from foo)
ENTER: 0x40056a @ 0x400618       (calling bar)
EXIT:  0x40056a @ 0x400618       (returning from bar)
foo=2 bar=1
EXIT:  0x4005d0 @ 0x2b59e0d471c4 (returning from main)

000000000040056a <foo>:
  40056a: push   %rbp
  ...

000000000040059d <bar>:
  40059d: push   %rbp
  ...

00000000004005d0 <main>:
  4005d0: push   %rbp
  ...
  400607: callq  40059d <bar>
  40060c: mov    %eax,%ebx
  ...
  400613: callq  40056a <foo>
  400618: mov    %eax,%esi
  ...

There are a few interesting things to note in the output.

Though main calls printf, we don't see a call to printf in the output. Function instrumentation is implemented during compilation, and we didn't compile printf we linked to an existing library. We'll only see the instrumentation for functions compiled with -finstrument-functions.
The call_site is the instruction after the one which vectors over to run the function.
The call_site which called main() looks strange. It is not in the TEXT segment, it is way up at some weird address. This is address space layout randomization in action, every run of this binary has a different address calling main. I don't know exactly what that routine is, but presumably it is part of the trampoline when the kernel begins executing a new process.

This instrumentation facility is not often used. The aforementioned call graph profiler is the only time I've used it. Nonetheless I hope you find it interesting.

Update 7/2011: In the comments Frank Denis notes that on OSX the functions are profile_func_enter() and profile_func_exit().

Thursday, September 16, 2010

Code Snippet: D_NOTIFY and inotify

D_NOTIFY is a facility in the Linux 2.4 kernel to monitor a directory for changes. It will send the monitoring application a signal when files in the directory are added, removed, or modified. It will be triggered if a new subdirectory is added, but does not trigger when files in that subdirectory are modified. D_NOTIFY sends a signal as notification. By default it will send SIGIO, though this can be changed.

In Linux 2.6 the far superior inotify interface was added. Where dnotify sends signals, inotify uses a file descriptor suitable for adding to select() or poll(). If your application will always run on 2.6, you should use inotify. If you need to support older kernels, dnotify still works in 2.6.

Code snippets using D_NOTIFY and inotify are provided below. Both examples monitor the current working directory for addition, removal, or modification of files.

D_NOTIFY

Suitable for 2.4 and 2.6 kernels.

#include <stdio.h>
#define __USE_GNU
#include <fcntl.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

volatile sig_atomic_t modified = 0;

static void handler(int signum, siginfo_t* si, void* data) {
  modified = 1;
}

#define MYSIG (SIGRTMIN+3)

int main(int argc, char** argv) {
  struct sigaction act;
  int fd;

  act.sa_sigaction = handler;
  sigemptyset(&act.sa_mask);
  act.sa_flags = SA_SIGINFO;
  sigaction(MYSIG, &act, NULL);

  fd = open(".", O_RDONLY);
  /* The default signal is SIGIO, but we use MYSIG instead */
  fcntl(fd, F_SETSIG, MYSIG);
  fcntl(fd, F_NOTIFY, DN_MODIFY | DN_CREATE | DN_DELETE | DN_MULTISHOT);

  while (1) {
    pause();

    if (modified) {
      printf("Directory modified!\n");
      modified = 0;
    }
  }
}

inotify

Suitable for 2.6 kernels, with a much nicer (non-signals based) API.

#include <stdio.h>
#include <stdlib.h>
#include <sys/inotify.h>

int main(int argc, char** argv) {
  int fd, watchdir, rlen;
  /* there is a variable length filename in the inotify_event, need to leave room for it. */
  char buf[sizeof(struct inotify_event) + 256];

  if ((fd = inotify_init()) < 0) {
    perror("inotify_init failed");
    exit(1);
  }

  if ((watchdir = inotify_add_watch (fd, ".",
                   IN_MODIFY | IN_CREATE | IN_DELETE)) < 0) {
    perror("inotify_add_watch failed");
    exit(2);
  }

  while ((rlen = read(fd, buf, sizeof(buf))) > 0) {
    struct inotify_event* event = (struct inotify_event*) buf;
    /* can examine event-> mask to determine what happened */
    printf("Directory modified!\n");
  }
}

Thursday, June 24, 2010

Virtual Trouble

After many years of working in plain C, I'm back to writing C++. I feel like an unfrozen caveman, confused by the flashing lights of the big city. Here is something I ran into recently.

#include <stdio.h>

class BaseClass {
 public:
  BaseClass() { InitName(); }
  virtual void InitName() { name_ = "BaseClass"; }
  char *name_;
};

class SubClass : public BaseClass {
 public:
  virtual void InitName() { name_ = "SubClass"; }
};

int main(int argc, char** argv) {
  BaseClass base;
  SubClass sub;

  printf("BaseClass name_ = %s\n", base.name_);
  printf("SubClass  name_ = %s\n", sub.name_);
}

A base class provides a virtual InitName() method, and calls it from the constructor. A subclass overrides InitName(), yet the overridden method is not called during construction. The BaseClass InitName() is used instead.

$ ./a.out
BaseClass name_ = BaseClass
SubClass  name_ = BaseClass

Why?

A Maze of Twisty Little Passages

Objects are constructed from the most basic class first. When the BaseClass() constructor runs, the SubClass methods and member variables have not yet been initialized. The object is a BaseClass object at that point. When BaseClass::BaseClass() returns, the object will be remarked as a SubClass object, and only then will its overridden methods actually do anything. Destructors work similarly. The outermost derived class is destroyed first, and by the time BaseClass::~BaseClass() runs the object will be of BaseClass type. Any virtual methods called from ~BaseClass() will call the BaseClass definition.

Scott Meyers Effective C++, Third Edition

Scott Meyers' Effective C++, 3rd Edition devotes a chapter to this topic, with considerably more detail. That chapter happens to be available online in an excerpt by the publisher.

For my specific issue, my object already had an Init() method to be called after object construction. It was straightforward to move the functionality from the constructor to Init(), with some checks to make it do something sensible if the caller neglects to call Init().

Thursday, April 15, 2010

Code Snippet: hash_map

hash_map is not part of the current C++ STL, but is universally adopted as an extension. Using hash_map is simple when using built-in primitives like int or char*. Attempting to use user-defined classes as a key is somewhat more difficult, as the following example demonstrates:

class example1 {
 public:
  example1() {};
  ~example1() {};

  uint8_t name_[8];
};

typedef __gnu_cxx::hash_map<example1,int> Example1HashType;
Example1HashType hash_map1;

int main(int argc, char **argv) {
  example1 e1;
  hash_map1.insert(std::make_pair(e1, 1));
}

When compiled with gcc 4.2 this results in:

/usr/include/c++/4.2/bits/stl_function.h:200: error: no match for ‘operator==’ in ‘__x == __y’

/usr/include/c++/4.2/ext/hashtable.h:595: error: no match for call to ‘(const __gnu_cxx::hash<example1>) (const example1&)’

hash_map requires two things: a hash function, and an ability to compare equality. To use a class as a key you need to provide a hash<> template specialized for the class in question. You also need to implement an equality operator.

class example1 {
 public:
  example1() {};
  ~example1() {};

  bool operator==(const example1 &other) const {
    return (memcmp(name_, other.name_, sizeof(name_)) == 0);
  };

  uint8_t name_[8];
};

namespace __gnu_cxx {
template<> struct hash<example1> {
  size_t operator()(const example1& k) const {
    size_t hashval = 0;
    for (int i = 0; i < sizeof(k.name_); ++i) {
      hashval = 5 * hashval + k.name_[i];
    }
    return hashval;
  }
};
}  // namespace __gnu_cxx

typedef __gnu_cxx::hash_map<example1,int> Example1HashType;

int main(int argc, char **argv) {
  example1 e1;
  Example1HashType hash_map1;

  hash_map1.insert(std::make_pair(e1, 1));
}

This works, but what if we cannot add an operator method? For example, perhaps the class is in a library which we cannot modify, or is created by a code generator. Consider this case where key is a simple struct with no member functions. This code fails to compile, due to lack of "__x == __y"

struct example2 {
  uint8_t name_[8];
};

namespace __gnu_cxx {
template<> struct hash<example2> {
  size_t operator()(const example2& k) const {
    size_t hashval = 0;
    for (int i = 0; i < sizeof(k.name_); ++i) {
      hashval = 5 * hashval + k.name_[i];
    }
    return hashval;
  }
};
}  // namespace __gnu_cxx

typedef __gnu_cxx::hash_map<example2,int> Example2HashType;
Example2HashType hash_map2;

int main(int argc, char **argv) {
  example2 e2;
  hash_map2.insert(std::make_pair(e2, 1));
}

As this is C++, the solution will of course involve more templates. hash_map does not directly invoke "x == y," it uses an equal_to<> template. The "__x == __y" compiler error is from the template. We can provide an equal_to<> specialization instead.

struct example2 {
  uint8_t name_[8];
};

namespace std {
template<> struct equal_to<example2> {
  bool operator()(const example2& x, const example2& y) const {
    return (memcmp(x.name_, y.name_, sizeof(x.name_)) == 0);
  }
};
}  // namespace std

namespace __gnu_cxx {
template<> struct hash<example2> {
  size_t operator()(const example2& k) const {
    size_t hashval = 0;
    for (int i = 0; i < sizeof(k.name_); ++i) {
      hashval = 5 * hashval + k.name_[i];
    }
    return hashval;
  }
};
}  // namespace __gnu_cxx

typedef __gnu_cxx::hash_map<example2,int> Example2HashType;
Example2HashType hash_map2;

int main(int argc, char **argv) {
  example2 e2;
  hash_map2.insert(std::make_pair(e2, 1));
}

Extending hash_map to custom key types makes it useful in far more situations.

Wednesday, March 10, 2010

Code Snippet: SIOCGIFCONF

A little while ago in this space we discussed SO_BINDTODEVICE, the socket option to control which physical interface will be used for packet ingress/egress. Recently in the comments of that post a question was posed: if you know the IP address of the interface, how do you programmatically find its name?

If there is a direct way to pass in an IP address and get back the interface name, I don't know it. The mechanism I know of is to retrieve the interface list from the kernel and walk through it until you find the IP address you're looking for. The code snippet below demonstrates the technique: the first use of SIOCGIFCONF determines the amount of memory we need, the second retrieves the interface list.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/if.h>
#include <netinet/in.h>
#include <arpa/inet.h>

int main()
{
   struct ifreq *ifr;
   struct ifconf ifc;
   int s, i;
   int numif;

   // find number of interfaces.
   memset(&ifc, 0, sizeof(ifc));
   ifc.ifc_ifcu.ifcu_req = NULL;
   ifc.ifc_len = 0;

   if ((s = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
     perror("socket");
     exit(1);
   }

   if (ioctl(s, SIOCGIFCONF, &ifc) < 0) {
     perror("ioctl");
     exit(2);
   }

   if ((ifr = malloc(ifc.ifc_len)) == NULL) {
     perror("malloc");
     exit(3);
   }
   ifc.ifc_ifcu.ifcu_req = ifr;

   if (ioctl(s, SIOCGIFCONF, &ifc) < 0) {
     perror("ioctl2");
     exit(4);
   }
   close(s);

   numif = ifc.ifc_len / sizeof(struct ifreq);
   for (i = 0; i < numif; i++) {
     struct ifreq *r = &ifr[i];
     struct sockaddr_in *sin = (struct sockaddr_in *)&r->ifr_addr;

     printf("%-8s : %s\n", r->ifr_name, inet_ntoa(sin->sin_addr));
   }

   free(ifr);
   exit(0);
}

Updates: Mike Ditto points out that the number of interfaces can change between the first call to SIOCGIFCONF and the second, as some workloads result in frequent netdev creation. He advises "ifc.ifc_len = ifc.ifc_len * 2;" before calling malloc. Michael Reed notes that unistd.h is required. It worked for me without it, but only because one of the other includes was pulling it in.

Tuesday, November 17, 2009

24.855134809027 Days

There have been issues with the autofocus on the Motorola Droid phone, which suddenly resolved themselves this morning and led to speculation of a stealth update. There is a fascinating comment in the Engadget forums by Dan Morrill (and noted in a tweet from Matt Cutts):

There's a rounding-error bug in the camera driver's autofocus routine (which uses a timestamp) that causes autofocus to behave poorly on a 24.5-day cycle. That is, it'll work for 24.5 days, then have poor performance for 24.5 days, then work again.

I suspect it is exactly 24 days, 20 hours, 31 minutes, 23 seconds, and 647 milliseconds, the amount of time for a millisecond quantity to overflow a signed 32 bit integer. This is a relatively common programming error, and one which can slip through a compressed QA schedule. In the case of the Droid, the camera was working fine while the QA team tested it and then stopped working slightly after the product shipped.

Thursday, October 8, 2009

Code Snippet: SO_BINDTODEVICE

In a system with multiple network interfaces, can you constrain a packet to go out one specific interface? If you answered "bind() the socket to an address," you should read on.

Why might one need to strictly control where packets can be routed? The best use case I know is when ethernet is used as a control plane inside a product. Packets intended to go to another card within the chassis must not, under any circumstances, leave the chassis. You don't want bugs or misconfiguration to result in leaking control traffic.

The bind() system call is frequently misunderstood. It is used to bind to a particular IP address. Only packets destined to that IP address will be received, and any transmitted packets will carry that IP address as their source. bind() does not control anything about the routing of transmitted packets. So for example, if you bound to the IP address of eth0 but you send a packet to a destination where the kernel's best route goes out eth1, it will happily send the packet out eth1 with the source IP address of eth0. This is perfectly valid for TCP/IP, where packets can traverse unrelated networks on their way to the destination.

In Linux, to control the physical topology of communication you use the SO_BINDTODEVICE socket option.

#include <netinet/in.h>
#include <net/if.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/socket.h>

int main(int argc, char **argv)
{
    int s;
    struct ifreq ifr;

    if ((s = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
        ... error handling ...
    }

    memset(&ifr, 0, sizeof(ifr));
    snprintf(ifr.ifr_name, sizeof(ifr.ifr_name), "eth0");
    if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE,
                (void *)&ifr, sizeof(ifr)) < 0) {
        ... error handling ...
    }

SO_BINDTODEVICE forces packets on the socket to only egress the bound interface, regardless of what the IP routing table would normally choose. Similarly only packets which ingress the bound interface will be received on the socket, packets from other interfaces will not be delivered to the socket.

There is no particular interaction between bind() and SO_BINDTODEVICE. It is certainly possible to bind to the IP address of the interface to which one will also SO_BINDTODEVICE, as this will ensure that the packets carry the desired source IP address. It is also permissible, albeit weird, to bind to the IP address of one interface but SO_BINDTODEVICE a different interface. It is unlikely that any ingress packets will carry the proper combination of destination IP address and ingress interface, but for very special use cases it could be done.

Monday, September 28, 2009

49.710269618056 Days

Western Digital recently corrected a firmware issue in certain models of VelociRaptor where the drive would erroneously report an error to the host after 49 days of operation. Somewhat inconveniently for RAID arrays, if all drives powered on at the same time they would all report an error at the same time.

Informed speculation: the drive reports an error after exactly 49 days, 17 hours, 2 minutes, 47 seconds, and 294.999 milliseconds of operation. That is the moment where a millisecond timer overflows an unsigned 32 bit integer.

Tuesday, June 30, 2009

Post-mortem debugging: core files

When one has a core file, one runs gdb. Its simply the way things were Meant To Be, right? Yet gdb isn't the right tool for the job in all cases. If you're dealing with a corrupted heap, gdb is not very helpful. You can see the portion of the heap which caused the process to fault (most likely in malloc or free), but identifying the junk in memory is an exercise in puzzling it out and looking for patterns. It is often useful in such cases to search the rest of the process address space for pointers into the corrupted area of the heap. Though gdb can be used to search for patterns in memory, it isn't very good at it. For example consider the following macro:

define searchmem
    set $start = (char *) $arg0
    set $end = (char *) $arg1
    set $pattern = (unsigned int) $arg2
    set $p = $start
    while $p < $end
        if (*(unsigned int *) $p) == $pattern
            printf "pattern 0x%x found at 0x%x\n", $pattern, $p
        end
        set $p++
    end
end

document searchmem
    search between $argv0 and $argv1 for pattern $argv2
end

This macro can look only for 32 bit numbers, not any sort of regular expression, and it is very, very slow. We'd really like to run grep, but if gdb provides a way to run grep over the core contents I haven't found it. Instead, Gentle Reader, we'll write a utility to output the core file to text so that grep or any other Unix tool can be used. This would be a job for od or hexdump except for two things:

We'd like to see the addresses of the data being dumped.
The core might be in the wrong endianness, such as a MIPS-BE core file on an x86 host.

Instead of using a generic binary dump like od we'll construct a tool specifically for core files, but first we need to understand their contents.

Process Address Space Linux process address space

Linux and modern Unix-ish operating systems dynamically map shared libraries into the process address space. These libraries are not packed tightly up against one another. For alignment and page protection reasons, there are gaps between them.

Each library generally consists of multiple segments:

instructions, called the TEXT segment.
uninitialized data to be zero filled, called BSS
initialized data (i.e. variables initialized to a non-zero value), called DATA

You can see the memory segments for a running process in /proc/<pid>/maps. Here is an example:

# cat /proc/1261/maps
0fc73000-0fc80000 r-xp 00000000 01:00 713        /lib/libA.so.1
0fc80000-0fc83000 ---p 0000d000 01:00 713        /lib/libA.so.1
0fc83000-0fc92000 rwxp 00000000 01:00 713        /lib/libA.so.1
...deleted...
10000000-1000c000 r-xp 00000000 01:00 1190       /bin/myapp
1001b000-1001c000 rwxp 0000b000 01:00 1190       /bin/myapp
...deleted...

libA's callable functions are in the TEXT segment, mapped at addresses 0x0fc73000 through 0x0fc80000. Note the permission bits on these pages: read-only plus executable. Write permission is denied, making it safe to share the same physical pages of RAM amongst multiple processes.

The libA BSS segment extends from 0x0fc80000 through 0x0fc83000. These pages are mapped with no permissions at all, even a read access will trigger a page fault. The kernel will supply a zero-filled page on the first fault, and mark its permissions as read+write.

The libA DATA segment is at 0x0fc83000 though 0x0fc92000. This region is populated with the initialized data, so it has read+write permissions already. No page fault will be triggered on access to these pages.

When the process dies, the core file needs to preserve these scattered memory areas. Core files from a Linux process are written in ELF format, which I will stubbornly continue to refer to as the Extensible Linking Format. ELF defines data sections with associated virtual addresses, allowing it to describe data scattered across a process address space. Let's examine the output of "readelf -l" on the core from this process:

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz   Flg Align
  NOTE           0x0003f4 0x00000000 0x00000000 0x006f4 0x00000      0
  LOAD           0x001000 0x0fc73000 0x00000000 0x00000 0x0d000  R E 0x1000
  LOAD           0x001000 0x0fc80000 0x00000000 0x00000 0x03000      0x1000
  LOAD           0x001000 0x0fc83000 0x00000000 0x0f000 0x0f000  RWE 0x1000
  ..etc...

The NOTE section contains various global information about the dead process, including its name and the contents of the CPU registers at the time it died. If you use "objdump -h" to list the core sections you'll see two additional sections: reg0/2 and reg0. These sections don't actually exist in the file, objdump breaks out the register contents from the NOTE section into their own pseudo-sections.

The LOAD sections contain the data from the process address space. The first one starts at a virtual address of 0xfc73000. That is the TEXT segment for libA, which we looked at earlier. Note the FileSiz is 0: to save space, the kernel skips dumping the contents of non-writable pages. gdb would fetch these sections from the executable file instead. The second LOAD section contains the libA BSS, and similarly its FileSiz is zero: this process died before it ever referenced anything in the BSS of libA. None of the pages had been faulted in and therefore none were writable.

The third LOAD section is the interesting one. This is the initialized DATA section. Because these pages were writable, they were saved to the core file and so the FileSiz is 0x0f000 (which matches the size of the segment from the /proc/<pid>maps file, above).

BFD

libbfd is one of several libraries available for working with ELF files. We'll use libbfd to process the core file, dumping it as hex words to a text file.

According to the history I can find, libbfd was created at Cygnus Support to deal with the myriad binary formats that sprang up in the 1980s and 90s. In response to how difficult it would be to encapsulate the various formats in a single library David Henkel-Wallace reportedly responded "big f---ing deal," and thus libbfd was christened. The name has since been clarified as "binary file descriptor."

    #include <bfd.h>

    bfd        *abfd;
    asection   *sect;
    char       *corefilename;
    enum bfd_endian endian;

    if ((abfd = bfd_openr(corefilename, NULL)) == NULL) {
        /* ... error handling ... */
    }

First we open the file using bfd_openr(). *abfd is the handle used to access the file, all other libbfd APIs take it as an argument.

    if (!bfd_check_format (abfd, bfd_core)) {
        printf("%s does not appear to be a core file.\n", corefilename);
        /* ... error handling ... */
    }

We check that the file is actually a core and not some other type of file. bfd_core is part of an enumerated type; other types which libbfd can check for include bfd_object for ELF programs or .o files, and bfd_archive for ar-style arcives.

    endian = abfd->xvec->byteorder;

libbfd handles the endianness of fields in the program and section headers, returning the result in the CPU's native byte order regardless of the endianness of the ELF file. It doesn't do anything about the data within those sections. Since we're going to dump the data in the core file, we need to know the endianness.

    for (sect = abfd->sections; sect != NULL; sect = sect->next) {
        bfd_vma        vma  = bfd_section_vma (abfd, sect);
        bfd_size_type  size = bfd_section_size(abfd, sect);
        const char    *name = bfd_section_name(abfd, sect);

We loop over each ELF section, pulling out the virtual address and size of the data it holds. The size will be zero in many cases, as shown above for the libA TEXT and BSS sections.

        if (!bfd_get_section_contents(abfd, sect, buf, (file_ptr)0, size)) {
            fprintf(stderr, "Could not read section %s\n", name);
            exit(5);
        }

We fetch the contents of each section, ready to print them to hex.

With a modicum of string manipulation we have a hex dump of the core contents:

0fc84710:  615f6669  6e616c69  7a65005f  4a765f52    a_finalize._Jv_R
0fc84720:  65676973  74657243  6c617373  6573006c    egisterClasses.l
0fc84730:  6962632e  736f2e36  0061646c  65723332    ibc.so.6.adler32
0fc84740:  00636f6d  70726573  73320064  65666c61    .compress2.defla

libbfd licensing

libbfd is GPL. It is not LGPL, which allows dynamic linking to proprietary code, but the full GPL. Any use of libbfd encumbers the rest of the software linked to it with the GPL and obligates you to provide the source code to anyone to whom you provide a binary. Its important to consider what this means: if you're developing tools for internal use by developers, the GPL is not onerous. Indeed, the source code of the utility will likely be checked into the version control system that all developers can access anyway.

More later...

If you for need to incorporate ELF file support into a proprietary software tool, then libbfd is not useable. I intended to provide the same core2hex example using FreeBSD's libelf, but this article is already quite long so I'm going to defer it until next time.

Friday, May 15, 2009

Pre-mortem Backtracing

A backtrace is often the first step in debugging a problem. Generating a backtrace is generally thought of as a function of the debugger, on a core file after a process has died. However it is sometimes quite useful to generate a live backtrace while a process runs. For example, crashing the process in the field may not be acceptable if a problem is survivable. Logging a backtrace and other information can provide enough to locate the root cause, without having to trigger any customer downtime.

gcc backtrace support

Ooooh, illustrations!. The simplest way to get a crude backtrace is the __builtin_return_address() macro supplied by gcc. You provide the frame number you want to retrieve, and get the return address for that stack frame:

void do_backtrace2()
{
    void *pc0 = __builtin_return_address(0);
    void *pc1 = __builtin_return_address(1);
    void *pc2 = __builtin_return_address(2);
    void *pc3 = __builtin_return_address(3);

    printf("Frame 0: PC=%p\n", pc0);
    printf("Frame 1: PC=%p\n", pc1);
    printf("Frame 2: PC=%p\n", pc2);
    printf("Frame 3: PC=%p\n", pc3);
}

This code will produce the following output:

Frame 0: PC=0x80483ca
Frame 1: PC=0x80483e1
Frame 2: PC=0x62079d
Frame 3: PC=0x80482b9

__builtin_return_address() has significant limitations. It constructs code at compile time to walk back through the stack frames. That means you cannot use a variable in a loop, you can only use integer constants like the 0,1,2,3 shown above. Also on some architectures, including my beloved MIPS, only __builtin_return_address(0) works. MIPS has no frame pointer, making it difficult to walk back up the stack. Frame 0 can use the return address register directly.

glibc's backtrace()

glibc contains a simple backtrace function, which is somewhat more powerful than __builtin_return_address(). The backtrace() call populates an array with the program counter of each calling function, while a separate backtrace_symbols() call can look up the symbolic names for each address:

#include <execinfo.h>

#define BACKTRACE_SIZ   64
void do_backtrace()
{
    void    *array[BACKTRACE_SIZ];
    size_t   size, i;
    char   **strings;

    size = backtrace(array, BACKTRACE_SIZ);
    strings = backtrace_symbols(array, size);

    for (i = 0; i < size; i++) {
        printf("%p : %s\n", array[i], strings[i]);
    }

    free(strings);  // malloced by backtrace_symbols
}

The output shows the backtrace with the address of each function call site:

# gcc -g -o backtrace ./backtrace.c
# ./backtrace 
0x8048422 : ./backtrace(backtrace_symbols+0xd6) [0x8048422]
0x80484be : ./backtrace(backtrace_symbols+0x172) [0x80484be]
0x80484d5 : ./backtrace(backtrace_symbols+0x189) [0x80484d5]
0x071479d : /lib/tls/libc.so.6(__libc_start_main+0xed) [0x71479d]
0x804837d : ./backtrace(backtrace_symbols+0x31) [0x804837d]

To get useful symbolic names, the -rdynamic option must be passed to the linker:

# gcc -g -rdynamic -o backtrace ./backtrace.c
# ./backtrace 
0x804874a : ./backtrace(do_backtrace+0x1a) [0x804874a]
0x80487e6 : ./backtrace(foo1+0xb) [0x80487e6]
0x80487fd : ./backtrace(main+0x15) [0x80487fd]
0x012679d : /lib/tls/libc.so.6(__libc_start_main+0xed) [0x12679d]
0x80486a5 : ./backtrace(backtrace_symbols+0x31) [0x80486a5]

There is also a backtrace_symbols_fd() function, which nicely prints the output to a file descriptor without having to malloc a table of strings. If thats all you're trying to do, it is a simpler API.

As an aside: notice how the address of __libc_start_main varies in the examples above, 0x62079d versus 0x71479d versus 0x12679d? That is address space randomization in action. libc is mapped at a randomized base address every time a binary is started. The offset of __libc_start_main within the page is a constant 0x79d, but the upper bits of the address will vary from one run to the next. This helps prevent buffer overflow and other code injection attacks, by randomizing the address of library routines.

libunwind

libunwind is a library for extracting call chain information. It supports many different CPU architectures. Here is an example of using libunwind to accomplish a similar result as glibc's backtrace() function:

#include <libunwind.h>

void do_backtrace2()
{
    unw_cursor_t    cursor;
    unw_context_t   context;

    unw_getcontext(&context);
    unw_init_local(&cursor, &context);

    while (unw_step(&cursor) > 0) {
        unw_word_t  offset, pc;
        char        fname[64];

        unw_get_reg(&cursor, UNW_REG_IP, &pc);

        fname[0] = '\0';
        (void) unw_get_proc_name(&cursor, fname, sizeof(fname), &offset);

        printf ("%p : (%s+0x%x) [%p]\n", pc, fname, offset, pc);
    }
}

The output:

0x80486b3 : (foo+0xb) [0x80486b3]
0x80486ca : (main+0x15) [0x80486ca]
0x016379d : (__libc_start_main+0xed) [0x16379d]
0x80484c9 : (_start+0x21) [0x80484c9]

That is quite a bit more code to get a simple backtrace, but libunwind offers more capability to examine the program state at each frame. For example, one can print the saved register values:

#include <libunwind.h>

void do_backtrace2()
{
    unw_cursor_t    cursor;
    unw_context_t   context;

    unw_getcontext(&context);
    unw_init_local(&cursor, &context);
    while (unw_step(&cursor) > 0) {
        unw_word_t  offset;
        unw_word_t  pc, eax, ebx, ecx, edx;
        char        fname[64];

        unw_get_reg(&cursor, UNW_REG_IP,  &pc);
        unw_get_reg(&cursor, UNW_X86_EAX, &eax);
        unw_get_reg(&cursor, UNW_X86_EDX, &edx);
        unw_get_reg(&cursor, UNW_X86_ECX, &ecx);
        unw_get_reg(&cursor, UNW_X86_EBX, &ebx);

        fname[0] = '\0';
        unw_get_proc_name(&cursor, fname, sizeof(fname), &offset);
        printf ("%p : (%s+0x%x) [%p]\n", pc, fname, offset, pc);
        printf("\tEAX=0x%08x EDX=0x%08x ECX=0x%08x EBX=0x%08x\n",
                eax, edx, ecx, ebx);
    }
}

The output:

0x80486b3 : (foo1+0xb) [0x80486b3]
 EAX=0x00000000 EDX=0x00b548b0 ECX=0x00000000 EBX=0x00000000
0x80486ca : (main+0x15) [0x80486ca]
 EAX=0x00000000 EDX=0x00b548b0 ECX=0x00000000 EBX=0x00000000
0x044879d : (__libc_start_main+0xed) [0x44879d]
 EAX=0x00000000 EDX=0x003368b0 ECX=0x00000000 EBX=0x00000000
0x80484c9 : (_start+0x21) [0x80484c9]
 EAX=0x00000000 EDX=0x00b548b0 ECX=0x00000000 EBX=0x00000000

When would this be useful? Given the relative costs, it is not unusual to have an embedded CPU with considerably more RAM than flash. In the event of a crash there may not be enough flash to save a full core file. Having the process deliberately catch SIGSEGV and dump its own backtrace with register values means you'd at least have something to work with even if there is no core file.

Conclusion

Over time I think I have used __builtin_return_address(0) more often than any of the other techniques. Whether constructing simple performance instrumentation or logging problems from the field, knowing the caller has often been sufficient. For more extensive backtrace functionality I end up using libunwind. The backtrace() function in glibc always seems to be too heavy for the simple stuff yet not sufficient for the complex stuff.

Thursday, February 26, 2009

Inadvisable Externing

The Principle of the Conservation of Software Quality postulates that for every best practice there is an equal and opposite worst practice which will be easier to implement. We're here today to talk about one of those balancing factors: declaration of externs within the C code.

Assume for a moment that foo() is defined somewhere within the codebase and, for reasons unclear, there is no header file declaring foo(). Perhaps foo() was not deemed important enough, or had been static when originally written and only later opened up to the rest of the module. However we got here if one has adopted the best practice of using -Wall -Werror when compiling, then calling foo() will require a declaration or result in a compilation error. One is then faced with several choices:

Add foo() to an existing header file
Make a new header file for foo() and any similar routines
Declare foo() as an extern in the C file where it will be called.

Perhaps the Gentle Reader is unfamiliar with that last option. If so, I salute you: your unfamiliarity with it speaks well of you. To summarize: "extern" means the function is not provided within this compilation unit and will be supplied later when linking. It is normally used in header files, but is also valid within C code:

do {
    extern int foo(int a, int b);
    c = foo(a, b);
} while (c != 1);

The benefit of a function declaration in a header file is that it can be included in multiple places, in the file which implements the function and any file which wants to use the function. If the implementation of the function changes such that it no longer matches the declaration, an error will result. If the header changes such that the callers no longer match, an error will also result. Declaring an extern within a C file accomplishes none of these things, because the compiler does not get to see both the declaration and the implementation of foo at the same time.

Lets examine what will happen if a third argument is added to foo() but the caller blissfully relies on an existing extern declaration with two arguments. We'll use one of my favorite techniques, disassembling a MIPS binary to see how it works. We'll use a ridiculously simple example for clarity:

int foo(int a, int b, int c)
{
    return (a + b + c);
}

<foo>:
addu v0,a0,a1	tmp = a + b
addu v0,v0,a2	tmp = tmp + c
jr ra	return to caller

foo() expects arguments in registers a0, a1, and a2, and sums them together. Now lets look at the code generated by an extern declaration with only two arguments:

extern int foo(int a, int b); /* Worst Practice */
int main()
{
    return foo(1, 2);
}

<main>:
li a0,1	load 1 into the first arg register
li a1,2	load 2 into the second arg register
lw t9,0(gp)	load address of foo()
jalr t9	jump to foo()

Not surprisingly, only registers a0 and a1 are loaded with values. The important point to note is that register a2 is not touched at all. One might have intuitively assumed that the third argument would be zero, but this is not the case. The third argument will be whatever garbage happens to be in register a2.

The third argument could be most anything, and might vary depending on the call chain or the data being processed. We end up with a sporadic and difficult to diagnose bug, hidden in a way that the compiler cannot help, and all because we didn't want to bother with a header file. This happened at a previous employer of mine, where a particular process would reliably crash at just one customer site because their particular environment ended up with a non-NULL value in the argument register. It was, of course, a crucial customer.

The moral of this article is that header files are your friend.

In Other News

Our family became larger in January, and the resulting lack of free time means postings will be less frequent. Previously I'd aimed for two postings per month, but now I think I'll be lucky to manage just one. I wrote several articles in advance, thinking that would be sufficient, but alas it was not enough. The Gentle Reader might have noticed one of those prepared postings appear twice. I accidentally marked the Blogroll article for 1/2008, and blogger.com happily sent it out immediately with a publication date one year prior. It also went out with the corrected date. I seem unable to expunge the mistaken early posting, that article still shows up twice in Google Reader. Oh well.

Wednesday, January 7, 2009

Variable Scoping with gcc

Has this ever happened to you? You allocate some sort of resource which needs to be released before returning from the function. You can put the release at the end of the routine for the normal case, but there are a number of error checks which can cause it to return early. You consider calling the reclaim routine inside each error handler, but you're concerned that someone maintaining the code in the future will forget to do so. Instead you put all of the cleanup code at the end of the function, and have each error check use a goto.

Then of course someone argues very strenuously that goto is evil incarnate and must never be used, one thing leads to another, and then you have to find somewhere to hide the body ... wait, nevermind that last bit.

Consider this instead:

int foo()
{
    int fd LOCAL_SCOPE_FD = open("/path/to/file");

    if (error1()) {
        return -1;
    }

    return 0;
}

Though it might appear that I'm suggesting the file descriptor be leaked in order to simplify the code, that is not the case. The magic happens in LOCAL_SCOPE_FD:

#define LOCAL_SCOPE_FD __attribute__((cleanup(local_fd_close)))

void local_fd_close(int *fd)
{
    if (*fd >= 0) close(*fd);
}

__attribute__(cleanup) is a gcc extension. When an automatic variable goes out of scope, the function indicated by the cleanup attribute will be called. If the scope is exited unusually, such as via longjmp() or by calling exit(), the cleanup function does not get called, but normal return statements or falling off the end of the block do work.

It also works within inner blocks. For example:

int foo()
{
    if (do_something) {
        int fd LOCAL_SCOPE_FD = open("/path/to/file");
    }
    /* local_fd_close will be called here. */

    ... more code ...
}

Cleanup can only be applied to automatic variables, i.e. variables on the stack declared within a function. It cannot be used with global or static variables. The cleanup attribute can be used with any variable type, not just integers. The cleanup function receives a pointer to the automatic variable being cleaned up.

I assume that if the Gentle Reader is reading this article, there are good reasons to use C in your problem space. Needless to say, if you want automatic resource reclamation a language other than C would provide more capabilities. Garbage collection and object finalizers are powerful constructs, given a problem space where they are appropriate.

Acknowledgements

Many thanks to Matt Peters for pointing out the __attribute__(cleanup) capability to me.

Ian Lance Taylor recently wrote on a similar topic, about support for destructors and exceptions in C.

Updates

This article was picked up on reddit, with a few comments. One pertinent comment from erikd:

His local-fd-close() function has a bug, it needs to check the return value of close(), because close can return an error of EINTR which means the prcoess received an interrupt and should retry the close operation.