From Quasi Paragon
Jump to: navigation, search

An AST is a representation of the program that is fairly close to what you might think of the source code yourself. it is essentially a set of typed objects representing things like 'expressions', 'types', 'functions'. In spite of it being called a 'tree', it is pedantically a directed graph. I guess simpler languages than C++ and simple code generators could consider it as a tree (well, Directed Acyclic Graph), but these days one has to anticipate loops in it.

Because ASTs are graphs, I'm going to use LISP pretty-printer-style back references. Y'all are familiar with lisp, right? No? Ok, nevermind. Here's the salient 2 features:

  1. #N=<whatever> Some numbered node in the graph that we want to refer to later
  2. #N# A back reference to labelled node

this notation saves me from figuring out how to draw pictures. I'm also going to use '->' to mean 'points-at' (rather than includes by value).

Let's consider the AST for a simple function:

int frob (int a, int b) {
  return a + b * 2;
}

The AST will look something like: FUNCTION_DECL

  Name->'frob'
  Type->FUNCTION_TYPE
     ReturnType->#1=INTEGER_TYPE
     ArgTypes->{#1#, #1#}  // Two ints
  Params->{#2=PARM_DECL:'a':#1#, // I've squished the representation a bit here
           #3=PARM_DECL:'b':#1#}
  CMPD_STMT
    RETURN_STMT
      PLUS_EXPR
        Type->#1# // an int
        Lhs->#2#  // ref to PARM_DECL for 'a'
        Rhs->MULT_EXPR
          Type->#1# // an int
          Lhs->#3# // ref to PARM_DECL for 'b'
          Rhs->CONSTANT
                Type->#1# // an int
                Value->2  // some bit-representation

Some salient points:

  • Everything is explicitly typed. The intermediate nodes of the expression have explicit types — we don't have to go inferring the type by looking at the operands.
  • I'm not explaining (here) how the compiler tied up the two instances of the name 'a' to a single parameter decl
  • Nodes are themselves categorized
  • An AST object's fields are generally statically typed as to what category of node they can point to

The categories of nodes for C++ are generally:

  • Declarations — things found by name in the source, like a variable or a function
  • Types — types, like 'int', 'pointer to int'
  • Expressions — things that get evaluated and provide a result, like a multiplication, assignment or function call.
  • Statements — imperative statements like if or for

The categories here reflect the structure of the source language. For instance, even though if (cond) a; else b; and cond ? a : b are somewhat similar, we represent them differently in the AST (as statement and expression nodes respectively)

Types

Let's dive a little more into type representations. There are essentially 3 kinds of types in C++:

  1. Fundamental types — int, float, etc
  2. Elaborated types — class Foo, enum Baz, etc
  3. Compound types — pointer to type, array of type, etc

Elaborated types have a name, by which they are found. So are they really types? Or are they decls? Let's put that aside for a bit, and consider compound types. Here the type is constructed from some other type. For instance a pointer to int, int *, would be:

POINTER_TYPE:
   Pointee->INTEGER_TYPE

By construction we always want the Pointee field to point at a type thing, not a decl, expression or statement. As we can have pointers to structures, that suggests elaborated types should smell type-like. But of course, they behave decl-like in some circumstances. Oo! I hear you say, multiple inheritance! Good plan, but not generally used. Although the AST has some static typing (by kind-of-node), it is very dynamic and tree walkers need to specialize on the actual node being inspected. Aha! dynamic_cast? Sadly not. dynamic_cast relies on polymorphic objects and is quite slow. In GCC's case, which was originally written in C, it's not even possible. However, Clang, a thoroughly modern C++ source base doesn't use dynamic_cast either. Both compilers use their own special mechanisms to achieve polymorphism, essentially storing a 'kind' value in each object.

Before returning to elaborated types, what about typedefs? There we have a type thing, and we're giving it a name. Typedefs are declarations. Here's one for typedef int my_int;:

TYPE_DECL
   Name->'my_int'
   Type->INTEGER_TYPE

A pointer to my_int, my_int *, smells the same as a plain int *, so we could represent it as that. But that'd lose the my_int name when looking at the pointer. Here compilers differ. Clang represents my_int as a SUGARED_TYPE[1] that references a plain int. GCC represents it as a TYPE_DECL to a cloned INTEGER_TYPE that itself back-references the TYPE_DECL. These have tradeoffs. In GCC the type hierarchy reflects the conceptual types directly. In Clang one has to keep desugaring types to get that representation, but the user-specified name is clearer.

Now you might see how we could extend TYPE_DECLs to deal with elaborated types:

#1=TYPE_DECL
    Name->'mystruct'
    Artificial  // Marker that this isn't a real typedef
    Type->RECORD_TYPE
         Name->#1# // back reference to the TYPE_DECL
         Members->{....}

In this way a compound type can point at the RECORD_TYPE node, but find the name of that structure by following the Name field to the TYPE_DECL.


  1. Typedefs are just syntactic sugar for other types, hence the name