Saturday, May 31, 2008

Random Musings on Embedded Filesystems

Getting the kernel to successfully boot on a new board is a wonderful feeling, which lasts about 5 seconds before the application developers begin clamoring for time on the box. For the apps work to make progress the application binaries have to be accessible in a filesystem somewhere. During platform bringup it is common to use NFS because it will "just work" almost immediately after you get the platform to boot.

I'd advise not relying on NFS for very long: using a server with gigabytes of disk means you're completely insulated from size constraints. You may find that the application footprint grows astronomically simply because nobody notices the pain. Adding library dependencies which pull in 20 Megs of DLLs is invisible on an NFS server, but a crisis when you discover very late in the process that it won't fit in flash.

Lets start by listing a few assumptions about the target:

  1. it does not have a hard disk (i.e. a spinning, magnetic disk)
  2. it has at least as much DRAM as flash, probably more.
  3. it uses a 32 or 64 bit processor with an MMU (PowerPC, MIPS, ARM, x86, etc)

If your platform is significantly different from these assumptions, the recommendations in this article may not be appropriate. The biggest one is the hard disk: if your system has one, don't use these recommendations - your filesystem choices should revolve around what the disk can do for you. If your system does not have an MMU some of these recommendations may apply, but you need to look at using uCLinux and its associated best practices.


 
A Note about the Buffer Cache

The Linux I/O system is designed for a general purpose computer, with a disk connected to an I/O interface. Recently accessed files are stored in memory in the buffer cache; if the files are accessed again they can be provided from cache instead of hitting the disk a second time. You may also come across references to a page cache; in current kernel versions the page cache and buffer cache have been merged into a single pool of memory.

Embedded systems generally don't have a disk, their filesystems reside in RAM. Nonetheless Linux still relies on the buffer cache. So though a file resides in a RAMdisk it will be read into the buffer cache, and then to the process which requested it. The buffer cache doesn't add as much value in this setup (re-reading the filesystem would be quick), but the buffer cache is essential to support Linux APIs like mmap(). The effect of the buffer cache will be described in a bit more detail later in the article.


 
Option #1: Squashfs and tmpfs

SquashFS is a compressed, read-only filesystem. A squashfs filesystem is constructed offline, compressing a directory structure into a binary blob which can be loaded into memory and mounted. As blocks are read from the squashfs they are decompressed into the buffer cache; if memory becomes tight the cache can toss pages out and re-read them from the squashfs as needed. Thus the memory footprint is fairly well minimized, only files in active use will be decompressed.

Squashfs is inherently read-only. It is not mounted read-only, such that you could remount it read-write. The squashfs filesystem is assembled at compile time; once created it is immutable. This is generally acceptable for filesystems which shouldn't change anyway, like /bin or /lib, but is clearly not suitable for everything. There are configuration files like /etc/resolv.conf and /etc/timezone which probably need to be occasionally written, and numerous scratch files reside in /tmp. Therefore a second filesystem is needed, a writeable ramdisk.

tmpfs is a ramdisk, and is a good way to provide the writeable portions of the filesystem. Daniel Robbins wrote an excellent article about tmpfs, published at IBM DeveloperWorks a few years ago. Squashfs would be used for the root filesystem mounted on /, while the tmpfs would be mounted on a subdirectory like /rw. /etc, /var and /tmp are most easily provided as soft links to the tmpfs. For example:

# ls  -l /
dr-xr-xr-x    1 root     472 May 16 10:10 bin
dr-xr-xr-x    1 root     836 May 16 10:10 dev
lrwxrwxrwx    1 root       7 May 16 10:10 etc -> /rw/etc
dr-xr-xr-x    1 root     555 May 16 10:10 lib
dr-xr-xr-x   97 root       0 Dec 31  1969 proc
lrwxrwxrwx    1 root       8 May 16 10:10 root -> /rw/root
drwxrwxrwt    9 root     180 May 16 10:12 rw
dr-xr-xr-x    1 root     151 May 16 10:10 sbin
lrwxrwxrwx    1 root       7 May 16 10:10 tmp -> /rw/tmp
dr-xr-xr-x    1 root      23 May 16 10:09 usr
lrwxrwxrwx    1 root       7 May 16 10:10 var -> /rw/var

At boot, the init script would mount the tmpfs filesystem and create the directories which the softlinks point to. You can specify a maximum size when mounting the tmpfs, 8 megabytes in this example:

if mount -t tmpfs -o size=8m tmpfs /rw >/dev/null 2>&1; then
    mkdir /rw/tmp /rw/var /rw/etc /rw/root
else
    # Halt and catch fire
fi

Squashfs has not been accepted into the mainline kernel, so you will need to download patches from the project website. There is also a LZMA-enhanced version of Squashfs, though I have not personally used this version. LZMA appears to obtain better compression ratios than the zlib which Squashfs normally uses.

For the root directory ("/"), the kernel has to be able to recognize and mount the filesystem without relying on any external mount program. This is a chicken and egg problem: there is no way to run a mount program until there is a filesystem mounted. To use squashfs for the root filesystem, linux/init/do_mounts.c must be modified to recognize its magic number. A patch is available, if your kernel source does not already have this handling. The patch also handles a cramfs root filesystem.


 
Option #2: cramfs instead of squashfs

cramfs is an older compressed, read-only filesystem for Linux, developed by Linus Torvalds. Cramfs works quite well and is included in the kernel sources. However squashfs typically achieves better compression, because it uses larger blocks in zlib. In one project I worked on a 20 Meg cramfs filesystem became about 18 Megs with squashfs.

Like squashfs, cramfs must be paired with a second, writable ramdisk to be useful.


 
Option #3: ext2 instead of tmpfs

ext2 can be use to provide a writeable ramdisk, though there are relatively few reasons to use it for this purpose as opposed to tmpfs. Older tmpfs releases did not support certain filesystem features like sendfile(), but this is not an issue for most applications.

Nonetheless if there is a reason to do so, an ext2 filesystem can be created in a /dev/ram# device and mounted as a ramdisk. ext2 has to be formatted before it can be used, which would normally mean bundling mke2fs into your target image. However there is another way to create an ext2 which is generally smaller than mke2fs. Empty, zero filled ext2 filesystems compress extremely well using bzip2. You can create a filesystem of the appropriate size while compiling the target image, by running commands on your build system:

dd if=/dev/zero of=/tmp/e2file bs=1024 count=8192
sudo losetup /dev/loop7 /tmp/e2file
sudo mke2fs -b 4096 -m 1 -i 16384 /dev/loop7
sudo tune2fs -c -1 /dev/loop7
sudo losetup -d /dev/loop7
bzip2 -9 /tmp/e2file

The bs and count arguments to dd specify the size of file to create, filled with zeros. We use a /dev/loop device to create a new ext2 filesystem in this file, and then compress it. The result should be a couple hundred bytes in size, far smaller than mke2fs would be. The e2file.bz2 is copied into the target filesystem, and mounted by the boot scripts like so:

bunzip2 -c /e2file.bz2 >/dev/ram2 2>/dev/null
if mount -t ext2 /dev/ram2 /rw >/dev/null 2>&1; then
    mkdir /rw/tmp /rw/var /rw/etc /rw/root
else
    # Halt and catch fire
fi

 
Option #4: JFFS2 instead of tmpfs

In the discussion of the previous alternatives the read-only portion of the filesystem would be compressed in memory using squashfs or cramfs, but the writable portion would be stored uncompressed. If your system requires a large amount of ramdisk space but your memory is constrained, the natural solution is to look for a way to compress it.

If the applications generating this data can be modified to use zlib and compress their own output, that is probably the best way to proceed. If you cannot modify the apps, there are a couple ways to get compression at the filesystem layer. The only natively compressed, writable filesystem for Linux I know of is JFFS2, which is not designed to be a ramdisk but can be pressed into service if necessary using the mtdram.o module (which exists to ease debugging of MTD applications). The vast majority of the complexity in JFFS2, for handling erase blocks and wear leveling and all of the other nuances of flash chips, is wasted when used as a ramdisk, but it does provide compression. The boot scripts would work like so:

/sbin/insmod -q /lib/modules/mtdram.o total_size=8192 erase_size=16
if mount -n -t jffs2 -o umask=0666 /dev/mtdblocka /rw >/dev/null 2>&1; then
    mkdir /rw/tmp /rw/var /rw/etc /rw/root
else
    # Halt and catch fire
fi

JFFS2 consumes a great deal more CPU time than other filesystems, due to the compression. The throughput to a compressed ramdisk will be relatively low. On one platform I worked on a tmpfs ramdisk could handle writes at about 40 MBytes/sec while the JFFS2 ramdisk managed only 1 MByte/sec.

Note that this technique does not save as much memory as you might think. Whenever blocks are accessed they are decompressed by JFFS2 into the buffer cache, then copied up to the application. Likewise written blocks are held uncompressed in the buffer cache. If your system touches a large amount of data in the ramdisk, the footprint of the buffer cache will become more significant than the backing filesystem in JFFS2. This is another advantage of modifying applications to use zlib: the data remains compressed in the buffer cache, and is only decompressed within the application itself.

There are other ways to implement a compressed, writable filesystem, but I have not used them myself and can't add anything pithy about them. Some links:

  • cloop is a compressed block loopback device, allowing any filesystem to be mounted atop it
  • e2compr patches compression support into ext2/ext3. Development was moribund for a long time, it is not clear to me how stable these patches are.

 
Option #5: ext2 for everything

In large measure ext2 and the related ext3 are the "default" filesystem for Linux. When building a Linux desktop or server ext[23] is a reasonable solution, and the one generally chosen barring good reason to do otherwise. This notion of ext2 as the default often carries over into embedded systems as well, though there is far less advantage to doing so. It is certainly possible to create a large ext2 ramdisk (using the /dev/loop technique shown above) and use it to store both the application binaries as well as provide scratch space for /tmp et al. This does have the appeal of requiring only one filesystem, rather than the combinations recommended earlier.

The memory footprint of an ext2 ramdisk is always going to be larger than the options described above. It is common practice to create the ext2 filesystem in a file, gzip it, and compile the gzipped binary into the target image. This reduces the image size, but at boot the kernel will decompress the entire filesystem into memory. So if you have a 40 Meg ext2 which gzips down to 10 Megs, it will only add 10 Megs to the image size but expand to 40 Megs of RAM on the target. Compare this to squashfs, where a 10 Meg filesystem adds 10 Megs to the image size and also consumes 10 Megs of RAM on the target. The buffer cache does not perturb the result: when using either ext2 or squashfs, any files in active use will be present in the buffer cache and the footprint would be the same in both cases.


 
Postscript

This turned into an exceedingly long post. Early drafts were even longer, and much material got left on the cutting room floor. I'll collect more random musings on embedded filesystems into a future post.


 
Updates

A comment from Lance pointed to a discussion on LKML about extending cramfs to be writable. Changes made to files would be stored in the page cache, and as far as I can tell would be stored in uncompressed form. The same technique could presumably be applied to squashfs. This would make things a bit simpler for those who need a small amount of writable ramdisk.

Tom Ziomek notes that SquashFS has been accepted into the mainline kernel in 2.6.29. Scroll down to see Tom's comment; since this article appeared this blog switched to using Disqus, so you see the older comments first followed by the recent comments in a separate Disqus section.

Thursday, May 8, 2008

The High-Level CPU Response

This time, Gentle Readers, we will explore something completely different. It is still related to embedded systems, for a suitably generous definition of "embedded." Several weeks ago I read a posting by Yossi Kreinin regarding CPU architectures optimized for very high level languages, plus a followup on the same topic.

Yossi said:
Do you think high-level languages would run fast if the stock hardware weren't "brain-damaged"/"built to run C"/"a von Neumann machine (instead of some other wonderful thing)"? You do think so? I have a challenge for you. I bet you'll be interested.
...
My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it?


Really, the challenge would not normally appeal to me:

  • I generally don't spend time on high-level languages. I make my living in the guts of the system writing assembly and C code, where Lisp/Haskell/Erlang/<ShinyNewLanguage>/etc do not play.
  • I like C. Really, I do. I like assembly coding too. Really, I do.
  • I think von Neumann architectures are fine.

However in the past few years I've worked on three different ASICs involving custom embedded CPUs for the datapath (more accurately: two CPUs and one glorified state machine), and Yossi's challenge is interesting in a contemporary problem space. So I'll give it a whirl. Be warned, Gentle Readers, that it will probably suck. You have been warned.


The instruction set

Intel Atom CPU In 2008, I believe the instruction set is the wrong place to look for these performance gains. The overwhelming effort put into optimizing the performance of current CPU architectures represents an enormous barrier to entry for a new instruction set. Intel itself has not made much headway with Itanium, unable to successfully compete with its own x86 products. A decade ago RISC was hailed as the superior alternative, but x86 now stands over the dead bodies of several of its erstwhile RISC competitors: Alpha and PA-RISC, for a start. Modern CPU instruction sets are pragmatically extended with new opcodes as needed, witnessed by the sheer number of times Intel and AMD have added more vector instructions to their x86 implementations.

Also (and crucially) CPU architectures so specialized as to only run code from a single programming language tend not to survive. There is a reason that Lisp Machines and the P-system CPUs no longer exist: they weren't terribly adaptable.

So I won't be suggesting a radically new instruction set.


Exploring the problem space

As I see it, there are three standout features of various high level languages which one might attempt to optimize for in the CPU design:

  1. dynamic typing: Type checks are performed at run time, not compile time like C or Java
  2. stackless languages: like Erlang, to support astonishing numbers of threads
  3. functional languages: functions have no side effects and there is no global state. The complete state a function operates on will be passed to it, by value not by reference

For #1, Dynamic Typing: I don't think pushing type information down to be directly visible to the CPU is really productive. I'm quite likely wrong in this (there have certainly been a large number of processors over the years with tagged arithmetic operations), but I'm also not really qualified to comment further. So I won't.

For #2, Stacklessness: Many CPU designs have precious little hardware dedicated to "the stack." For example, in MIPS32 by convention two registers are used to store sp and fp, but they are really just normal registers and the compiler uses regular ADD/MOV/etc instructions to adjust them. Other CPU families may have more hardware supporting the concept of a stack, but I just don't think there is much to be gained by removing it. There is also a whole lot to be lost, if it is no longer possible to run stack-based languages like C.

For #3, Functional Languages: I believe this is the topic which has the most traction, and is where we'll focus for the rest of this writeup. Passing a large footprint of arguments by value implies a great deal of copying. Of course the runtime for these languages make efforts to avoid unnecessary copies, but perhaps modifications to the CPU can help the runtime perform this function more efficiently.


Avoiding unnecessary copies

Though the programming model for a functional language may guarantee each function call receives its own independent copy of all arguments, in implementation the runtime will avoid copying large amounts of data unnecessarily. If a function only reads and never writes an argument, it can be allowed to reference the parent's copy of the argument. The crucial bit, of course, is ensuring that there is no code path which results in writing to the parent data.

If the functional language compiles down to machine instructions, then any possible code path which results in modification of an argument makes it unsafe to reference the parent data. Even if the language is implemented by a VM, there is overhead in checking each store bytecode at runtime to ensure its target is allowed.

In other problem spaces there is a well-understood technique for optimistically allowing a child a read-only reference to the parent's data: Copy On Write. The relevant pages of memory will be mapped read-only. If a process attempts to write to this page it will trap into the operating system, which will make a private copy of the page for the process to modify. If Copy on Write were available, the functional language runtime would have another option available.

The trouble with applying existing Copy on Write mechanisms to functional languages is the sheer frequency of mappings. The current uses of Copy on Write are infrequent: process creation, periodic (but rare) snapshots of data, etc. The overhead of calling into the Operating System to change page tables is acceptable for these uses. The overhead of calling into the OS to change page tables for every function call would likely be prohibitive.


Overview of the proposal

"There is no problem in computer programming which cannot be solved by an added level of indirection." -- Dr Maurice Wilkes
64 bit CPUs are now the norm in the general purpose computing market. This is a truly staggering amount of virtual address space. If the application code could manage a section of its own address space with very low overhead, it could create and destroy mappings as needed. For cases where it makes sense, the language runtime could set up a read-only alias mapping of the parent data, and pass the alias address to the function. If the function attempts to modify its arguments there would be a trap, and a copy could be made.

Before delving further lets examine when it makes sense to do this:

  • if the function cannot possibly write to its arguments, and the runtime can be certain of this, there is no reason to create an aliased mapping. Just pass in the parents data.
  • if one or more of the arguments to the function are relatively small, just copy them. The brute force solution is cheaper than trying to finesse it.
  • if the function might modify its arguments and they are large, we have a candidate for an aliased argument.

Aliasing TLB

What would the hardware need to provide, the Gentle Reader might ask? Nothing much really, just a second MMU TLB. Simple. Ok, I kid, thats actually an enormous burden to place on the hardware, but I think it might be worthwhile. This is how I think it would work:

  1. I'll refer to the new TLB as the Aliasing TLB, which is looked up first. I'll refer to the existing MMU as the System TLB.
  2. The CPU has a pair of limit registers, giving an upper and lower bound of addresses to be looked up in the Aliasing TLB. Most processes would probably not use the Aliasing TLB at all, and the limit registers would be zero.
  3. Applications using this mechanism would be given a section of their address space to manage - an exabyte should do. Whenever there is a context switch, system code (i.e. the kernel) updates the limit registers for the new running process.
  4. Memory operations whose target is within the range of the limit registers would be looked up in the Aliasing TLB. We'll come back to the miss case later, for now assume an entry is found.
  5. The result of the Aliasing TLB lookup is a mapping to another virtual address, plus permission bits.

If the permissions allow the operation, the CPU now has a new virtual address to use as the target of the memory operation. This new virtual address (labelled Va in the diagram) proceeds down the "traditional" CPU MMU path, searching the System TLB to find the physical address. If the original target address was not within the range of the limit registers, then the unmodified virtual address goes to the System TLB to be translated to a physical address.


Handling a miss

Everything is fine and dandy if the CPU finds a matching entry in the Aliasing TLB, but what if it doesn't? I'm very fond of software-managed TLBs like MIPS and Sparc v9, and had initially written up a section on how the CPU could directly vector over to a TLB miss handler in the application's address space. The handler would need to be locked into memory so it could never be paged out, and immediately switch to a pre-allocated stack specifically for TLB handling, and ... yeah, it was really complicated.

On further reflection, I don't think it needs to be that complicated. There are two characteristics of the Aliasing TLB which are different from the System TLB: entries are added to the Aliasing TLB immediately before they are used, and the entries only live for the span of one function call - generally speaking, a very short lifespan.

When the OS adds a new entry to page tables the new entry is generally not immediately added to the System TLB, instead it will be added the first time the TLB faults on the new address. With the Aliasing TLB, when a new entry is added it should immediately be pushed into the hardware before resuming processing. We're expecting to immediately hit that address, and the number of Aliasing TLB misses would be dramatically reduced by doing so. Perhaps the misses which remain can take a more normal path of trapping into the OS, which would perform an upcall to the application to service the Aliasing TLB miss.


Page sizes

Function arguments can be of any size and alignment, and may not be aligned to any particular boundary. An earlier draft of this posting proposed segmented memory support, to handle arbitrary argument sizes. On further reflection, I don't think it needs to be that complicated. The Aliasing TLB mechanism only makes sense for relatively large arguments, and I think it is a reasonable tradeoff to require the runtime to pad such arguments to be an even multiple of a page size.

I do think it is important to support multiple page sizes, with the normal page size of the System TLB as a lower bound and truly enormous pages at the upper end. CPU MMUs have supported multiple page sizes for a long time, though Operating Systems generally only use one size for all memory pages. This is to avoid the complexity of fragmentation if one tried to actively page out memory with multiple page sizes in use. Some OSes do use large pages to map in things which can never be paged out, like the framebuffer.

Page replacement is not a concern for the Aliasing TLB: it is mapping one virtual address to another, there is no concept of removing a page from memory. If we have a 128 Megabyte argument to map, we might as well use a 128 Megabyte page to map it. Assembling a bunch of small mappings to handle a huge region is only beneficial in very specific cases. The software will take advantage of multiple page sizes in the Aliasing TLB.


Miscellaneous notes and drawbacks
  • The most serious downside of the Aliasing TLB is the latency it adds to instruction processing. To hide that latency the CPU will need additional pipeline stages, possibly many additional stages. Nobody wants a repeat of the SuperPipelineing of the Pentium IV, the additional pipelining may be a fatal drawback of this idea.
  • We'll need Address Space Identifiers (ASID) for the Aliasing TLB. The OS must populate the ASID of the currently running process in a CPU register, and any mappings created by the user code must inherit this ASID. If we allow application code to supply the ASID we'll destroy the security of the system: it could set up nearly arbitrary mappings in another process.
  • The map addresses in the Aliasing TLB are always looked up in the System TLB, never recursively looked up in the Aliasing TLB even if they fall within the specified range of addresses. I do not believe multiple levels of indirection add any capability to the mechanism, and the hardware complexity in the pipelining would be prohibitive.
  • I don't think it is reasonable to expect application code to deal with page coloring, so a CPU implementing an Aliasing TLB should use physically indexed caches (or some other solution which does not burden the software with coloring). This is probably not an issue, I believe page coloring has not been employed by any common processor for about ten years. A long time ago I worked on software for a Sparc processor which required it, and still bear the scars.

The disclaimer

Would an Aliasing TLB really boost the performance of a functional language runtime, and by how much? I've no idea. That is the sort of question best answered by an analysis of the behavior of existing functional language runtimes. Analyzing instruction and memory traces is the seventh most boring task in all of computer science, which means it is very, very boring. Since I'm writing this for a blog post, I choose to skip the boring and tedious stuff in favor of emphatic handwaving.

I'm not sure I've actually addressed the challenge Yossi issued. He was pretty clearly seeking CPU instruction set ideas, and I haven't provided any. Also the challenge was issued in late January and subsequent comments indicate it is pretty much wrapped up, so I'm late to the party. Nonetheless it was interesting to write this up, so to me the time was well spent.