From Quasi Paragon
Jump to: navigation, search

I know I said compilers aren't magic. I lied. They do contain some magic, and here it is.

A compiler is just a data processing program, written in some computer language. However, the data the compiler operates on is an input consisting of some programming language, and an output consisting of something consumable by the hardware (or VM). So, if you write the compiler in the language it compiles for, it can compile itself -- woah! Recursion!

Writing the compiler in its own language is very powerful. Let's consider the bit of a C++ compiler that processes string literals. String literal processing involves scanning the source text and converting the occasional escape sequence to the character it represents. That code might look a bit like:

char c = *src++;
if (c == '\\')
  switch (*src++)
  {
    case 'n': c = '\n'; break; // Magic!
    ... other cases
  }

That looks fine. If I see a backslash, look at the next character and if it's one of several special cases (such as 'n') replace it with the escape (newline). But ask yourself, where is the knowledge of the numeric value of that escape? Where is the fact that a newline character is 0xa? Nothing in that source says that -- the target character is written as '\n'!

The knowledge is in the compiler binary. In there there's a move reg,0xa instruction that the above assignment compiled to. It is propagated from one instance of the compiler binary to the next one via the above bit of source code.

I find this magic.

Writing the first compiler

When string escaping was first added, the compiler source couldn't say '\n' as the output value, because the older instance of itself wouldn't understand that. The source of the compiler would have to explicitly say c = 0xa; at that point. We can then compile the new compiler with the old compiler. After that point we can change the source to say c = '\n';, and so long as we always use a post-escape-built compiler to compile ourselves with, we're golden.

This is how compilers are extended. We write the new feature using the existing features, then we can modify the compiler to use the new features in itself. This happened in GCC with the conversion from C to C++ as the implementation language. First we had to convert the source to be written in the common subset of C & C++ — or hide the differences behind macros. Then the C++ compiler can compile itself. Then we got rid of the common subset, writing new stuff in C++. The scar tissue is still there, with the compiler as a whole in a strange mix of C-ish C++ with the amount of C-ishness reflecting the age of that bit of the compiler.

Bootstrapping

Bootstrapping the compiler is the process of delivering a new improved compiler, given an existing compiler executable and the source to the new compiler. We don't just compile the new source with the old compiler and deliver that. There are two issues:

  1. we want the new compiler binary to be optimized using the new optimizations in its source
  2. we want to make sure the new compiler is not miscompiling itself (that would be very bad, considering we're going to use that to build the next compiler)

So we compile the compiler 3 times:

  1. compile it unoptimized with the old compiler. This means we're not at the mercy of the old compiler's optimizer bugs (other bugs maybe). This generates a stage-1 compiler, which can optimize but runs slowly.
  2. compile it optimized with the stage-1 compiler. This produces the stage-2 compiler, an optimizing compiler that runs fast.
  3. compile it again optimized with the stage-2 compiler. This stage-3 compiler should be the exact same binary as the previous stage.

If the stage-3 and stage-2 compilers are different, there's a bug. Fix it! Things get even more exciting with profile-driven self optimization, or cross compiling.

An IBM compiler was once modified to improve the performance of the floating point routines.[1] Unfortunately that change had a bug such that the string 1.0 generated a floating point value that wasn't quite 1.0 (or some similar important constant). The bug was such that whenever the compiler compiled itself, the resulting compiler's value of 1.0 was slightly worse than the previous compiler's. It wasn't until sometime later that this was noticed. Oops. Apparently it was an exciting task to fix that bug! (You can't simply fix the bad routine, because the compiler you have to compile with is itself broken and will propagate its brokenness. You have to find a known-good binary to use.)

See also Ken Thomson's Trusting Trust talk.

Gödel's Incompleteness Theorem

Gödel's incompleteness theorem proves that in any sufficiently complex formal system there will be statements that that system can neither prove nor disprove. In order to get there Gödel developed Gödel numbering, which is simply a way of encoding formulae into integers. But wait, a computer program a sequence of bits, which can be interpreted as a (very large) integer. A compiler binary is merely a special number that when fed its source, it produces itself!


  1. This may be apocryphal, reference welcome.