Reconstructing Ruby, Part 4: Our First Parser

Read Part 3 in case you missed it.

Now that our lexer is in a pretty good situation, we can start working on our parser. We'll be using Bison. Bison works very similar to Flex, but instead of generating a lexer, it will generate a parser.

Bison is an improvement on the program Yacc. It's an LALR parser generator. LALR stands for Look-Ahead Left-to-right, Rightmost derivation parser, which is a mouthful. I'm not going to even try to explain this.

In programming, a grammar refers to what tokens are allowed to be next to each other, as well as what tokens, when next to each other, represent a higher level concept.

First let's describe a program. In Ruby, a program is just a list of expressions. In programming terms, an expression is a piece of code that returns a value. In Ruby pretty much everything will return a value so we can consider everything an expression.

Other languages have an alternative to expressions called statements. Statements are like expresions but do not have a return value. Ruby technically has no statements as everything returns a value.

To start our grammar, we'll say a program is just a list of expressions:

program: expressions  

program and expressions are both known as non-terminals. You can think of a non-terminal as "not a token". The tokens on the other hand we'll refer to as terminals.

The main idea here is that every non-terminal must eventually resolve to a terminal. To some degree that makes sense because our program is just built up of tokens so we eventually want to get to a point to where we're talking about those tokens. One way to generalize a parser is to say it takes an array of tokens and from them builds a tree like structure known as an AST or Abstract Syntax Tree. You can find out more about ASTs on this Wikipedia page.

An interesting part of definining a grammar is dealing with recursion. For instance we've described a program as a list of expressions, but what is a list of expressions? The way we'd describe this in Bison would look something like this:

expressions: expressions expression  
           | expression

The | is kind of like an or. So in this case expressions are either an expression or what was previously determined to be expressions followed by another expression. The way this get's simplified is like this. Imagine we have 3 expressions following each other (I've added numbers to the end for clarity):

expression#1 expression#2 expression#3

When we parse this we're always trying to work our way to our starting non-terminal program, so we simplify the first expression to expressions

expressions expression#2 expression#3

Since an acceptable simplification of expressions would be expressions expression we can collapse them together:

expressions expression#3

Then we have a similar situation where we can simplify again to just:

expressions

Now since we can't simplify any futher expressions becomes program:

program

Bison files end with a .y extension due to Yacc. The format is very similar to our Flex file. The Bison manual defines it as:

%{
  Prologue
%}

Bison declarations

%%
Grammar rules  
%%

Epilogue  

To start our grammar, let's work downwards from the program definition. Make a file called parse.y and put the following code into it:

%{
    #include <stdio.h>
%}

%token tNUMBER
%start program

%%

program: expressions

expressions: expressions expression  
           | expression

expression: tNUMBER { printf("PARSED(%d)\n", $1); }  

Here we've defined a token tNUMBER which is a terminal. A common idiom is to prefix your tokens with the letter t. If a tNUMBER token is matched it will print $1 which is a reference back to tNUMBER because it's in the first position. If we had a larger simplification like expressions expression then expressions would be stored in $1 and expression in $2 due to their positions. We'll see more $ variables when our patterns get larger.

We've also defined the non-terminals program, expressions and expression. All of these non-terminals will eventually be expressible in tokens. A program contains expressions which is a set of expression which must be a tNUMBER.

Let's modify our Makefile to compile this now:

SRC=main.c parse.tab.c lex.yyc  
all: ruby                                     

ruby: ${SRC}  
> cc -o ruby ${SRC}

lex.yy.c: ruby.l  
> flex ruby.l

parse.tab.c: parse.y  
> bison -d parse.y

clean:  
> rm -rf ruby lex.yy.c parse.tab.c parse.tab.h

We've introduced SRC to make adding new files easier. Now as the list of files we need to compile grows we can specify them in one locations.

The -d flag for Bison produces the header file parse.tab.h. This header file will include definitions for tokens, like tNUMBER, which is useful for our lexer since all of the tokens will be defined in our parser.

If you run make you'll probably get this noise:

$ make
bison -d parse.y  
flex ruby.l  
cc -o ruby main.c lex.yy.c parse.tab.c  
parse.tab.c:1257:16: warning: implicit declaration of function  
  'yylex' is invalid in C99 [-Wimplicit-function-declaration]
      yychar = YYLEX;
               ^
parse.tab.c:601:16: note: expanded from macro 'YYLEX'  
# define YYLEX yylex ()
               ^
parse.tab.c:1385:7: warning: implicit declaration of function  
  'yyerror' is invalid in C99 [-Wimplicit-function-declaration]
      yyerror (YY_("syntax error"));
      ^
2 warnings generated.  
Undefined symbols for architecture x86_64:  
  "_yyerror", referenced from:
      _yyparse in parse-d9f59c.o
ld: symbol(s) not found for architecture x86_64  
clang: error: linker command failed with exit code 1 (use -v to  
  see invocation)
make: *** [ruby] Error 1  

The reason this is happening is because in order for Bison to work, it expects two functions to be defined - yylex and yyerror. We'll make use of extern again to reference yylex and we'll define a yyerror. At the top of our parse.y file add the following:

%{
  #include <stdio.h>
  extern int yylex(void);
  void yyerror(char const *s) { fprintf(stderr, "%s\n", s); }
%}

Now we'll need to have our main.c file call yyparse instead of yylex. At the top of our main.c add:

#include "parse.tab.h"

Replace:

extern int yylex(void)  

With:

extern int yyparse(void)  

And replace the call to yylex with yyparse. If we change program.rb to contain just the number 1, compile and run our program, then we'll see:

$./ruby program.rb 
NUMBER(1)  
syntax error  

Oops, we'll need to change our lexer to return the correct token. Open ruby.l and replace:

{NUMBER} { VTYPE("NUMBER", yytext); }

with:

{NUMBER} { yylval = atoi(yytext); return tNUMBER; }

yylval holds the value of our token, which is currently the integer value of yytext thanks to atoi. Next we return the token type tNUMBER.

The last change we need to make to ruby.l is to add:

#include "parse.tab.h"

Right under:

#include <stdio.h>

This lets our lexer knows about the tokens we've defined in our parser. When we compile and run our program now we should see:

./ruby program.rb
PARSED(1)  

From this point we're ready to expand our grammar, define more tokens, and refine our lexer to utilize them. Let's start simple. Let's change program.rb to contain:

1 + 3  

And when we run the program we want the output to be 4. Running the program currently outputs:

PARSED(1)  
PLUS  
PARSED(3)  

Notice that while PLUS is printed it wasn't parsed. It was instead printed from our lexer. We'll fix that by adding a token for PLUS that we can return from our lexer. Add:

%token tPLUS

Below the tNUMBER token. Then change:

"+" { TYPE("PLUS"); }  

to:

"+" { return tPLUS; }  

If you run make and run the program now we'll end up with a syntax error. The reason for this is that we haven't described in our grammar a valid "sentence" with plus in it. So let's extend out expression grammer a little. We're going to change:

expression: tNUMBER { printf("PARSED(%d)\n", $1); }  

to:

expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }

Running the program now will produce the output of 4. The reason for this is because when the parser encounters a tNUMBER it will just return that value (which is currently an integer). So when it sees an expression of the form expression tPLUS expression that's the same as 1 tPLUS 3 which in our case $1 = 1 and $3 = 3. If our program.rb had 2 + 4 in it then $1 = 2 and $3 = 4 and the output would be 6.

At this point we're ready to extend out grammar to cover more of the ruby language. In the next post We'll define some more complex expressions in our grammar so that:

# The Greeter class
class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

# Create a new object
g = Greeter.new("world")

# Output "Hello World!"
g.salute  

will be considered a valid program by our language.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 5 of this series.