From Quasi Paragon
Jump to: navigation, search

Compilers are complex beasts, but fundamentally they're a sequence of transformations. That's somewhat clear from the way I segmented Compilers page into FE, ME & BE components. Each of those performs quite different operations. Those stages themselves are part of a larger pipeline.

Your source code gets fed into the toolchain and it squirts out an executable program. The compiler proper, usually finishes with generating assembly code or binary code, depending. When you invoke the compiler:

g++ hello.cc -o hello

what you're actually invoking is a compiler driver program. That's essentially a shell, that sets up a process pipeline to do the work. Those processes may communicate via actual pipes, but usually it's via files. The pieces the driver can invoke are:

  • The compiler proper, cc1plus. This is the guts of the compilation, and the main focus of what I'm talking about. There will be different compilers for different languages -- C:cc1, Fortran:fortran951 etc. (the 1 suffix is from the original K&R days). When I say 'compiler', this is what I'm talking about.
  • The assembler, as. If the compiler spits out textual assembler, it'll need assembling to machine code.
  • The linker, ld. This takes a bunch of object files and libraries to produce an executable (or dynamic library). It does this by resolving relocations which are references from one object file to something defined in another object file.

You can inspect what the compiler driver is doing by adding the -v flag:

g++ -v hello.cc -o hello

Components

Here are the components in the compiler:

  • Preprocessor. For C-like languages, there are actually two phases of language analysis. The first is the preprocessor, which is essentially a textual transformation. We used to have separate preprocessors, but these days it's all integrated into the compiler proper. If you use -E to get preprocessor output, cc1plus is actually invoked twice, once to generate the preprocessed output and again to read that back in and continue.
  • Lexer. This piece is responsible for tokenizing the input. That is figuring out what's a number, an identifier, random punctuation, keyword.
  • Parser. This is the bit that understands the source language. It figures out what bits are declarations of variables, functions and the like. What an expression looks like, and figuring out what function a particular call resolves to. In C++-land it deals with template instantiation and what not.
  • Optimizers. There will be a whole bunch of optimizers, transforming the code to make it go faster or be smaller. There could be hundreds of different optimizations, all running in a particular order.
  • Target code generation. Towards the end of the compilation, we need to generate assembly code for the CPU we're targeting. One important piece here is register allocation. That figuring out which registers in the CPU can be used for each instruction.

As your program moves through these stages, it may be represented differently. The different stages are doing different things, so it's not unreasonable to think they may prefer different representations. Sadly, there is no Grand Unifying Representation. What's good for one stage can be really bad for another.