From Quasi Paragon
Jump to: navigation, search

The optimizers use a different representation to the front-end's AST. They're solving a different problem, and AST is not an appropriate data structure. Generally, when discussing IR and compilers, it is the form used by the optimizers that is meant. There are two major components of the IR:

An important representation of local variables is Static Single Assignment.


The instructions that make up the bodies of the BBs are (usually) abstract — they don't correspond directly to instructions executed by the target processor. However, they look very similar in that they have what is known as a 3-address form. This corresponds to:

result = arg1 OP arg2;

which satisfies the vast majority of cases. OP is an operator such as +, -, *, / or whatever. Of course there are exceptions which I'll get to some other time.

Unlike assembly language, this is explicitly typed. result, arg1 and arg2 all have types (which must be correct), as does the operator (it might still be implicit). All conversions will be explicit. For instance:

u_thing = (unsigned int)thing;

Also unlike assembly code, there is an unbounded number of temporary variables — pseudos. We can generate new pseudos as we desire — a late stage optimization pass will figure out where to put them all. In the below I use tmp1 etc as temporaries. In the compiler dumps you'll see them as D.nnnn and _n.

One thing this does have in common with assembler is that there are no complex operands -- the operands must be constants or variable, no sub expressions. The earlier example of sum += ix * ix would be represented as:

tmp = ix * ix;
sum = sum  + tmp;

I'll talk about the exceptional cases of instructions that don't fit this form, or the need to represent some special operation, some other time.

Compiler Dumps

For the really curious, G++ has -fdump-tree-all where it'll spew its guts. -fdump-tree-lower outputs one of the early pre-ssa internal forms. Here's that summation routine, with my annotations in comments:

int sum(int) (int a)
  int ix;
  int sum;
  int D.1541;  // a temorary we created when simplifying
  int D.1540;

  sum = 0;
  ix = 0;
  <D.1538>:  // a label marking the beginning of a BB
  if (ix >= a) goto <D.1535>; else goto <D.1539>;
  D.1540 = ix * ix;  // we broke sum += ix *ix into two instructions
  sum = D.1540 + sum;
  ix = ix + 1;
  goto <D.1538>;
  D.1541 = sum;
  goto <D.1542>;
  return D.1541;

That's not SSA-form (for instance, ix gets more than one assignment). Use -fdump-tree-ssa to get the first SSA representation:

int sum(int) (int a)
  int ix;
  int sum;
  int D.1541;
  int D.1540;
  int _6;  // more temporaries
  int _9;

  <bb 2>:
  sum_3 = 0;
  // a clone of the original ix variable. Happens to have UID 4
  ix_4 = 0;

  <bb 3>: // now labelling blocks by their block number
  // a PHI node value is sum_3 if we arrived here from bb 2,
  // and sum_7 if we came from bb 4
  # sum_1 = PHI <sum_3(2), sum_7(4)>
  # ix_2 = PHI <ix_4(2), ix_8(4)>
  // a_5 is the 'a' argument, I'd have to go look to understand why (D)
  if (ix_2 >= a_5(D))
    goto <bb 5>;
    goto <bb 4>;

  <bb 4>:
  _6 = ix_2 * ix_2;
  sum_7 = _6 + sum_1;
  ix_8 = ix_2 + 1;
  goto <bb 3>;

  <bb 5>:
  _9 = sum_1;

  return _9;

GCC spews out variables of the form <name>_<uid> where <name> is the user-given name (empty if a compiler-generated temporary) and <uid> is its internal numbering of variables — so every <uid> will be unique.