From Quasi Paragon
Jump to: navigation, search

Parsing is the process by which your source code gets turned into Abstract Syntax Tree (AST). There's a whole theory of parsers and grammar, so let's look at some bits of it.

Language specifications consist of two major components.

  • Grammar — what types of words make a sentence.
  • Semantics — what those sentences mean.

I'm going to skip the preprocessor. Consider everything here as being applied to the preprocessed source code.


Before diving into grammar, let's start by splitting the source code into a sequence of tokens. This process is called lexing, from lexical analysis. For C-like languages this is quite simple. We break the input into:

  • identifiers - A sequence of alphanumerics starting with an alpha (let's just say flatworm (_) is a letter, ok?)
  • keywords - some are identifiers, if, and some are punctuation + or ++
  • literals - string literals like "Hello world!", integer literals 5

We have a max munch rule to disambiguate:

  • +++ is ++ followed by +
  • abcd is the identifier abcd not two identifiers ab and cd

Each of these tokens has a kind and a value. I.e.

  • kind:identifier - value:the sequence of chars
  • kind:keyword: - value:some enumeration
  • kind:literal: - value:whatever constant it is

We've stripped out whitespace and comments. Unlike natural languages we don't have ambiguities here — a token cannot have multiple kinds.


Simple declarative English sentences are <noun-phrase> <verb> <noun-phrase> and noun-phrase is either a proper noun (Nathan), or <article> <regular-noun> (the boy). Programming languages are just the same. But of course, we have a whole language to describe grammars themselves. This is important, because unlike English, we need to be very precise. The language is called Backus-Naur Form[1] (BNF).

BNF is constructed from a two kinds of entities:

  • Terminals — the tokens that can come from the lexer. We're only interested in the type (identifier, keyword, literal)
  • Non-Terminals — a rule in the grammar

All the BNF is, is a set of rules recursively describing how a non-terminal is constructed from a sequence of other terminals and non-terminals. A non-terminal might have several alternatives. Here's a simple BNF for a very simple language:

program : statement-seq

statement-seq : statement
              | statement-seq statement

statement : expression ;
          | identifier = expression ;

expression : value
           | expression + value
           | expression - value
           | expression * value

value : literal
      | identifier
      | ( expression )

Usually we use different fonts to distinguish the non-terminals from keywords, but that's too hard here. The alternate sequences a non-terminal can be are shown with | prefixes. These alternatives are called reductions — we're reducing a sequence to a single non-terminal. In this language I can write:

bob  = 5;
bob + 6 * 5;
bill + bob;

but I can't write:

= 5;
+ 5;

There are automated programs like yacc (Yet Another Compiler Compiler) that can take a BNF and produce a parser for that language. These things generally rely on important restrictions in the grammar, that programming languages try to stick to. The above grammar is an LALR(1) grammer. That means:

  • Look Ahead. I need to look ahead some number of tokens to figure out which (set of) paths I'm going down.
  • Left Recursive. The statement-seq rule is statement-seq statement, not statement statement-seq.
  • (1) 1 token lookahead. The simplest case.

C++ is not an LALR(1) grammar, it requires unbounded backtracking because of the decl/expr ambiguity (see Vexing parse). However, the language specification uses a BNF and has words to describe ambiguities and how they are resolved.

Back in the 70s and 80s everybody used generated parsers using things like yacc and bison (Gnu's yacc variant). Then everybody switched to hand written recursive descent parsers. Such are the fashion trends in compilers!


All the grammar is capable of doing is saying 'yes, this is a sentence of $language', or 'nope'. That's not very useful. To bring it to life we add an action to each reduction. When we determine that we've found an unambiguous reduction, we invoke the action. It is that action that builds up the AST!

Let's add actions to the `expression` reduction above:

expression : value     { $$ = $0; }
           | expression + value { $$ = build_node (ADD, $0, $2); }
           | expression - value { $$ = build_node (SUB, $0, $2); }
           | expression * value { $$ = build_node (MULT, $0, $2); }

here I'm using YACC grammar, BTW[2]. Regular C code is inside the {...} but with some special substitutions. $$ means 'result of this action', $N means value of Nth item in the matched sequence. Even though + etc are keywords in the grammar I'm parsing, they have values inside the YACC grammar, so I have to use $0 and $2 to get at the LHS and RHS operands. You can probably see how we can use this to build up an entire AST.

Error Handling

As you're no doubt aware, there are two kinds of errors; syntax and semantic.

Syntactic errors are when there is no reduction that can accept the next token. Here the trick is error recovery. Generally one marks the BNF with special error markers, that inform the parser how to recover — keep popping reductions and skipping tokens until one matches something after an ERROR marker.

Semantic errors are when the kinds of things being given to an action don't make sense — subtracting a string for instance. Generally the action returns something like 'NULL' after emitting a suitable diagnostic. The bill + bob statement above is a semantic error, because although bill is an identifier, we never declared it.

  1. [*] It was originally Backus Normal Form, but Peter Naur was so influential (proposing Backus Normal Form), that Donald Knuth proposed retconning it to Backus-Naur Form.
  2. Compiler writers love designing Domain Specific Languages.