From Quasi Paragon
Jump to: navigation, search

Parsing C++ has some non-trivial corners. See Parsing for the overview of parsing. Here I'll talk about the ambiguities in C++.

The canonical ambiguity is between something being a declaration of a new thing, or an expression evaluating some existing things. It occurs because C++ has functional casts. C doesn't have that, a cast is always (T)expr. But in C++ I can write T (expr), when T is spellable as an id-expression, for example std::string (var). In the middle of some larger expression, that's quite clear. But at the very beginning of a line[1], it could be a declaration of a new variable called var.

Why? Because you can put extra parentheses around parts of a declaration. You hardly ever do that, unless you're trying to declare a pointer to a function:

void (*ptr) (); // ptr is pointer to function returning void;
void *ptr ();   // ptr is a function returning pointer to void;

But the grammar doesn't care (see below). A declaration is (I'm simplifying the grammar):

declaration = decl-specifier-seq declarator ;

decl-specifier-seq = int/long/static etc

declarator = identifier                    // simple name
           | ptr-or-ref declarator         // pointer/reference to thing
           | declarator [ expr-opt ]       // array of thing
           | declarator ( paren-list-opt ) // function returning thing
           | ( declarator )                // parenthesized declarator

See that the forms of declarator are recursive.

The disambiguating rule in C++ is 'anything that looks like a declaration, is'. So std::string (var); is a declaration of var, when it appears as a complete statement-or-declaration. But here's the difficulty, something after the declarator-like thing might make it not-a-declaration. For example std::string (var) + "". It's only when we meet the + that we discover it can't be a declaration.

What to do? Enter tentative parsing. Tentative parsing is the ability to try one particular path in the grammar, and then if it doesn't work out, backtrack and try a different path. Remember, simple grammars (like C) only require 1 token lookahead to determine the set of possible reductions to proceed down. Here we require arbitrary backtracking.

So the parser has a tentative capability, which can be nested(!), and the ability to see if a tentative path worked out. For the above, it'd look a bit like:

parser->push_tentative ();
  parser->parse_declaration ();
  if (parser->ok () && parser->looking_at (';')) {
     // it's a declaration
     parser->commit_tentative ();
     parser->eat_token();
} else {
   parser->abort_tentative ();
   parser->parse_statement ();
}

We must be careful in tentative parsing. We:

  • must not change global state (or have a way of undoing it)
  • must not emit errors (we might back out of this entirely)
  • must not emit warnings (likewise)

This is quite hard. As is generating a good diagnostic for when neither alternative worked — we really want to remember the point each alternative barfed, and emit both errors.

Notice in the above I changed the grammar for simplification. In real C++ I can declare a bunch of things like int a, b, c;. What if I wrote that as int (a), b + 1, is that a declaration of a followed by a syntax error, or is it a compound-expression consisting of int (a) and b + 1? It's the former — we don't have to be utterly heroic in back tracking.

C++ constructor notation complicates things further. T v (a); is a declaration of v constructed from variable a. But if I apply a functional cast on that argument to make T v (U (a)), it's now a declaration of a function called v taking an argument of type U with an extra parenthesis around the argument name. This gets harder with more arguments to the constructor:

T v (U (a), V (b), W (3)); // variable
T v (U (a), V (b), W (c)); // function

It's only when we parse the final W (3) that we discover the first one cannot be a function declaration, because 3 is not a valid declarator.

Making the Grammar Care

What if we changed the grammar so that you could only parenthesize things like pointers-to-functions. That's possible. We might split the grammar above:

declarator = identifier                    // simple name
           | ptr-or-ref declarator         // pointer/reference to thing
           | declarator [ expr-opt ]       // array of thing
           | declarator ( paren-list-opt ) // function returning thing
           | paren-declarator              // parenthesized declarator

paren-declarator = ( ptr-or-ref declarator )

This forces the parenthesized thing to have a leading * or &, which is what we care about. I can write T (thing) and be confident that must be an expression. And I can write T (*fptr) () and have that be a pointer-to-function. But I can also write T (*ptr) and be ambiguous. Let's try a bit harder:

paren-declarator = ( ptr-or-ref declarator ) [ expr-opt ]
                 | ( ptr-or-ref declarator ) ( paren-list-opt )

This forces a parenthesized declarator to be part of a function or array declaration.

So this is doable, at the expense of being incompatible with C, and being non-orthogonal. It's not the direction C++ went in.

Other Ambiguities

There are some other ambiguities, not all of which need such heroics to parse. Here is the list in the standard (as of Dec '17)

  • [stmt.ambig]/1 'There is an ambiguity in the grammar involving expression-statements and declarations: An expression-statement with a function-style explicit type conversion (8.2.3) as its leftmost subexpression can be indistinguishable from a declaration where the first declarator starts with a (. In those cases the statement is a declaration.' (this is the above case).
  • [dcl.ambig.res]/1 ' The ambiguity arising from the similarity between a function-style cast and a declaration mentioned in 9.8 can also occur in the context of a declaration. In that context, the choice is between a function declaration with a redundant set of parentheses around a parameter name and an object declaration with a function-style cast as the initializer. Just as for the ambiguities mentioned in 9.8, the resolution is to consider any construct that could possibly be a declaration a declaration.' (another case of the above)
  • [dcl.ambig.res]/2 'An ambiguity can arise from the similarity between a function-style cast and a type-id. The resolution is that any construct that could possibly be a type-id in its syntactic context shall be considered a type-id.' (as above)
  • [dcl.ambig]/3 'Another ambiguity arises in a parameter-declaration-clause when a type-name is nested in parentheses. In this case, the choice is between the declaration of a parameter of type pointer to function and the declaration of a parameter with redundant parentheses around the declarator-id. The resolution is to consider the type-name as a simple-type-specifier rather than a declarator-id.' (as above)
  • [expr.unary.op]/10 'There is an ambiguity in the grammar when ~ is followed by a class-name or decltype-specifier. The ambiguity is resolved by treating ~ as the unary complement operator rather than as the start of an unqualified-id naming a destructor.' (this doesn't require tentative parsing)
  • [dcl.enum]/1 'A : following “enum nested-name-specifier opt identifier” within the decl-specifier-seq of a member-declaration is parsed as part of an enum-base. [ Note: This resolves a potential ambiguity between the declaration of an enumeration with an enum-base and the declaration of an unnamed bit-field of enumeration type.' (struct S { enum E : int {}; }; that's a member enum definition, not a zero-width unnamed bitfield of enum type.)
  • [dcl.fct]/18 'There is a syntactic ambiguity when an ellipsis occurs at the end of a parameter-declaration-clause without a preceding comma. In this case, the ellipsis is parsed as part of the abstract-declarator if the type of the parameter either names a template parameter pack that has not been expanded or contains auto; otherwise, it is parsed as part of the parameter-declaration-clause.'
  • [temp.arg]/2 'In a template-argument, an ambiguity between a type-id and an expression is resolved to a type-id, regardless of the form of the corresponding template-parameter.' (in f<int ()> the argument is a function type, not a default-constructed int.)

Blue Sky

IMHO it all went wrong wih K&R style inside-out type declarations. Blame them. Philosophically, the important thing about a declaration is the name of the thing being declared. Then the category of the thing (var, type, function, template) and finally the details of that category. We read English left-to-right, so mirroring that in the declarations would probably help. Instead, we've invented a strange recursive mini-language that the compiler itself has to invert to get the type structures it wants anyway. Imagine if C looked like:

declaration = identifier storage-opt type;

type = fundamental-type            // int/short/whatever
        | ( param-list-opt ) type  // function returning type
        | ptr-or-ref type          //  pointer or reference to type
        | cv-qual type             // qualified type
        | [ expr-opt ] type        // array of type

It'd look like

bob int; // an int variable
frob (x int, y int) int; // fn taking 2 ints returning an int
table [10] int;  // array of 10 ints. 
bill * const int; // pointer to constant int
// fn taking int returning pointer to function returning nothing
frob (int) * (int) void; 

  1. The colloquial meaning of 'line', just after a semicolon or '{'.