Benchmarks are a crock

With modern superscalar architectures, 5-level memory hierarchies, and wide data paths, changing the alignment of instructions and data can easily change the performance of a program by 20% or more, and Hans Boehm has witnessed a spectacular 100% variation in user CPU time while holding the executable file constant. Since much of this alignment is determined by the linker, loader, and garbage collector, most individual compiler optimizations are in the noise. To evaluate a compiler properly, one must often look at the code that it generates, not the timings.

Our benchmarks may not be representative

Many of our benchmarks test only a few aspects of performance. Such benchmarks are good if your goal is to learn what an implementation does well or not so well, which is our main concern. Such benchmarks are not so good if your goal is to predict how well an implementation will perform on "typical" Scheme programs.

Some of our benchmarks are derived from the computational kernels of real programs, or contain modules that are derived from or known to perform like the computational kernels of real programs: fft, nucleic, ray, simplex, compiler, conform, dynamic. earley, maze, parsing, peval, scheme, slatex, nboyer, sboyer. These benchmarks are not so good for determining what an implementation does well or less well, because it may be hard to determine the reasons for an unusually fast or slow timing. If one of these benchmarks is similar to the programs that matter to you, however, then it may be a good predictor of performance. On the other hand, it may not be.

Real programs may not be representative either

The execution time of a program is often dominated by the time spent in very small pieces of code. If an optimizing compiler happens to do a particularly good job of optimizing these hot spots, then the program will run quickly. If a compiler happens to do an unusually poor job of optimizing one or more of these hot spots, then the program will run slowly. For example, compare takl with ntakl, or nboyer with sboyer.

If the hot spots occur within library routines, then a compiler may not affect the performance of the program very much. Its performance may be determined by those library routines. For example, consider the performance of gcc on the diviter or perm9 benchmarks.

The performance of a benchmark, even if it is derived from a real program, may not help to predict the performance of similar programs that have different hot spots.

A note on C and C++

It is well known that C and C++ are faster than any higher order or garbage collected language. If some benchmark suggests otherwise, then this merely shows that the author of that benchmark does not know how to write efficient C code.

As an example of C code that is much faster than anything that could be written in Scheme, I recommend

Andrew W Appel. Intensional equality ;-) for continuations. ACM SIGPLAN Notices 31(2), February 1996, pages 55-57.


Last updated 26 December 2007.