Dabbling Seriously

Code, Coffee, and other Stuff.

Thoughts on type-safe parser generation

Several years ago (2005 (!!)) I wrote a tool for generating LALR(1) parser tables, and an associated parser. I wanted to do this because there were a couple of things that annoyed me about parsers written in yacc/bison. The main one of these was that I had to choose just one type to represent the values stored on the parser stack. If I had to choose just one type it might be a base class for every syntax tree type. Then you would store a pointer to the base class, and implement each type that you wanted to represent in the grammar as a subclass.

class IAmASyntaxTree
   virtual ~IAmASyntaxTree()

But my first preference would have been for each token to have it’s own type. As I recall, when using bison/yacc you could specify a union that would hold any one of all of your types. This was great as long as you didn’t want to take advantage of the C++ RAII (Resource Acquisition is Initialization) idiom, because when you use unions in C++, your constructors and destructors aren’t called. This meant that you had to write a seperate garbage collector for any memory that you might have wanted to allocate during parsing. This might look like a linked list of IAmASyntaxTree pointers that you would iterate through and delete when you’d finished parsing. This was all a great big pain in my butt, to get not what I really wanted in the first place.

What I really wanted

My preference would have been to be able to specify a different type for each token, and have the parser generator sort the pain of that out behind the scenes. If I had a raw token like an operator, I didn’t want to bother with the type at all, so that would be type ‘void’ and if I had something like a literal or an identifier, it would be type ‘std::string’, and an expression might be type IExpression or std::auto_ptr<IExpression> even, in case I wanted to take advantage of C++’s standard library.

token<std::string> tLiteral, tIdentifier;
token<void> tOpenPar, tClosePar, tPlus, tMinus, tMul,
tDiv, tSemicolon, tBegin, tEnd, tWhile, tBreak, tContinue,
tReturn, tFunction, tAssign, tIf, tLoop, tElse, tElseIf,
tComma, tLocal, tGlobal, tEq, tNe, tLess, tGreater;
token<float> tDecimal;

token<void> ntProgram, ntDefinitions, ntDefinition;
token<CFunctionHeader*> ntFunctionHeader;
token<IStatementPtr> ntBlockStatement, ntSimpleStatement, ntStatements, ntOptionalElse;
token<IValExprPtr> ntLExpr;
token<IValExprPtr> ntRExpr, ntReturnExpr;
token<CStringListPtr> ntArgs;
token<CExprListPtr> ntExpressions;

The ‘Ptr’ types are typedefs of std::auto_ptr<T> so that I didn’t have to write a C++ template parser in the parser for my parser generator. Something that is notoriously difficult to do.

Once I have my tokens and types, I want to specify the productions:

ntProgram -> ntDefinitions {}
ntDefinitions -> ntDefinition ntDefinitions {}
ntDefinitions -> {}
ntDefinition -> ntFunctionHeader tSemicolon {}
ntDefinition -> ntFunctionHeader tBegin ntStatements tEnd {this->implementFunction($1,$3);}
ntFunctionHeader -> tFunction tIdentifier tOpenPar ntArgs tClosePar {$$ = this->registerFunction($2, $4);}
ntStatements -> ntSimpleStatement tSemicolon ntStatements {$$ = this->linearStatementList($1, $3);}
ntStatements -> ntBlockStatement ntStatements {$$ = this->linearStatementList($1, $2);}
ntStatements -> {$$ = this->emptyStatementList();}
ntLExpr -> tIdentifier {$$ = this->refExpr($1);}
ntLExpr -> ntRExpr {$$ = $1;}
ntSimpleStatement -> tReturn ntReturnExpr {$$ = this->returnStatement($2);}
ntSimpleStatement -> {$$ = this->nullStatement();}
ntSimpleStatement -> ntRExpr {$$ = this->expressionStatement($1);}
ntSimpleStatement -> tBreak {$$ = this->breakStatement();}
ntSimpleStatement -> tContinue {$$ = this->continueStatement();}
ntSimpleStatement -> tLocal tIdentifier {$$ = this->localVarStatement($2);}
ntSimpleStatement -> tGlobal tIdentifier {$$ = this->globalVarStatement($2);}
ntBlockStatement -> tIf tOpenPar ntRExpr tClosePar ntStatements ntOptionalElse tEnd {$$ = this->ifStatement($3, $5, $6);}
ntBlockStatement -> tLoop ntStatements tEnd {$$ = this->loopStatement($2);}
ntReturnExpr -> ntRExpr {$$ = $1;}
ntReturnExpr -> {}
ntRExpr -> ntRExpr tLess ntRExpr {$$ = this->lessExpr($1, $3);}
ntRExpr -> ntRExpr tGreater ntRExpr {$$ = this->greaterExpr($1, $3);}
ntRExpr -> ntRExpr tPlus ntRExpr {$$ = this->plusExpr($1, $3);}
ntRExpr -> ntRExpr tMinus ntRExpr {$$ = this->minusExpr($1, $3);}
ntRExpr -> ntRExpr tMul ntRExpr {$$ = this->mulExpr($1, $3);}
ntRExpr -> ntRExpr tDiv ntRExpr {$$ = this->divExpr($1, $3);}
ntRExpr -> ntRExpr tEq ntRExpr {$$ = this->eqExpr($1, $3);}
ntRExpr -> ntRExpr tNe ntRExpr {$$ = this->neExpr($1, $3);}
ntRExpr -> tMinus ntRExpr {$$ = this->negExpr($2);}
ntRExpr -> tOpenPar ntRExpr tClosePar {$$ = $2;}
ntRExpr -> ntLExpr {$$ = this->getValueExpr($1);}
ntRExpr -> tDecimal {$$ = this->constantExpr($1);}
ntRExpr -> tLiteral {$$ = this->literalExpr($1);}
ntRExpr -> ntRExpr tOpenPar ntExpressions tClosePar {$$ = this->callFuncExpr($1, $3);}
ntSimpleStatement -> ntLExpr tAssign ntRExpr {$$ = this->assignStatement($1, $3);}
ntExpressions -> {$$ = this->CExprListPtr(new CExprList());}
ntExpressions -> ntRExpr {$$ = CExprListPtr(new CExprList()); (*$$).push_back($1);}
ntExpressions -> ntExpressions tComma ntRExpr {$$ = $1; (*$$).push_back($3);}
ntOptionalElse -> tElse ntStatements {$$ = $2;}
ntOptionalElse -> tElseIf tOpenPar ntRExpr tClosePar ntStatements ntOptionalElse {$$ = this->ifStatement($3, $5, $6);}
ntOptionalElse -> {}
ntArgs -> {$$ = CStringListPtr(new CStringList());}
ntArgs -> tIdentifier {$$ = CStringListPtr(new CStringList()); (*$$).push_back($1);}
ntArgs -> ntArgs tComma tIdentifier {$$ = $1; (*$$).push_back($3);}

The parser generator will do the rest, generating a parse table, and a parser. Once you have that, all you need to do is implement your lexical analyzer, and language runtime, and you have a working programming language and interpreter.

Back in 2005 I had this working in my visual studio 2002 compiler, but up until recently it had never built in gcc. I’ve just about got this working, so might release the code soon, if I get around to it.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


This entry was posted on August 24, 2011 by in Uncategorized.
%d bloggers like this: