I once worked on a system where one module was developed outside using Linux/x86 systems, brought in-house, and compiled for Linux/PowerPC. We thought we had been careful in the specifications: avoid endianness assumptions, limit memory footprint, and assume a hefty derating for the slower PowerPC used in the real system. Things looked good in initial testing, but when we started internal dogfooding the PowerPC performance dropped off the proverbial cliff. An operation that took 100 msec on the x86 development system and 300 msec during initial PowerPC testing regressed to an astonishing 45 seconds in the dogfood deployment.
The cause of this disparity was the data cache. For reasons unclear this code iterated through its configuration many, many times. On x86 the various levels of D$ comprise several megabytes, but the PowerPC had only 16K. As the dogfooding progressed and the config grew it resulted in unbelievable cache thrashing and a 2.5 order of magnitude performance drop.
Several years ago Ulrich Drepper wrote an excellent paper about all things related to memory in modern system architectures, especially x86 but relevant everywhere. It is a long read, but very worthwhile. The complete paper is available as a PDF from his site, and it was also serialized in articles on LWN.
- Introduction
- CPU caches
- Virtual memory
- NUMA systems - local versus remote references
- What programmers can do - cache optimization
- What programmers can do - multi-threaded optimizations
- Memory performance tools
- Future technologies
- Appendices and bibliography
I downloaded the PDF and read it over the course of a few weeks. I strongly recommend this paper, the information content is very high.