Reconstructing Ruby, Part 7: Expanding our grammar

Read Part 6 in case you missed it.

When we left off, we were getting a syntax error from running ./ruby program.rb. If you're up to date, then you should see the following:

ID(class)  
syntax error. Unexpected "class" on line 2  

We can see that the lexer is identifying class as an ID and that our grammar doesn't expect to see a token of typeID. Ideally, our lexer would treat this as a keyword since class is something special in Ruby.

The typical definition for a class in ruby looks something like:

class Foo  
  # expressions
end  

Since defining a class requires two keywords class and endlet's add them bost as keywords. In order to do this, we'll add the following rules to the top of rules in ruby.l:

class { KEYWORD(CLASS); }  
end { KEYWORD(END); }  

And then add the following macro below our TOKEN macro:

#define KEYWORD(id) return k##id

In our parse.y file we'll want to add tokens for our keywords:

%token kCLASS kEND

Now we'll add a rule to our grammar for defining classes. Modify expressions to contain the following:

expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }
          | kCLASS tCONSTANT expressions kEND

Now when we run make && ./ruby program.rb we get the following:

CONSTANT(Greeter)  
ID(def)  
syntax error. Unexpected "def" on line 3  

For def, we'll follow the same process we did for class. The grammar rules will look a little different:

| kDEF tID expressions kEND
| kDEF tID tLPAREN tID tRPAREN expressions kEND

While this doesn't handle all of the cases yet, it handles both of our method definitions in our example app. Now when we run make && ./ruby program.rb we see the following:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
syntax error. Unexpected "@" on line 4  

Let's define rules in our grammar for variable:

variable: tID  
        | tAT tID

Then expand expression to contain:

| variable

Now our error message is:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
syntax error. Unexpected "=" on line 4  

So let's add assignment to our grammar.

assignment: variable tEQUAL expression  
expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }
          | kCLASS tCONSTANT expressions kEND
          | kDEF tID expressions kEND
          | kDEF tID tLPAREN tID tRPAREN expressions kEND
          | variable
          | assignment

If we run make && ./ruby program.rb we see the dreaded shift/reduce...

parse.y: conflicts: 1 shift/reduce  

This one is fairly easy to resolve. In the case of equality we want everything to the right of the equal sign to be resolved first so we're going to add the following to parse.y:

%right tEQUAL

Now when we run make && ./ruby program.rb we don't see a shift reduce, but instead get the following error:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
ID(name)  
syntax error. Unexpected "." on line 4  

Now we'll write grammar rules for all of the method calls in our example program:

method_call: variable tDOT tID  
           | tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN
           | tID tSTRING

And add method_call to our expression rule:

| method_call

Again these are very specific to our program, but we'll extend it later on. Now when we run our program we see the following:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
ID(name)  
ID(capitalize)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(g)  
CONSTANT(Greeter)  
ID(new)  
STRING("world")  
ID(g)  
ID(salute)  

So, we no longer have any syntax errors, but we shouldn't be printing the token type and value. It would be great if we could both identify the token and pass the value like we did with numbers:

{NUMBER} { yylval = atoi(yytext); TOKEN(NUMBER); }

Let's change CONSTANT to do exactly this. Change:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); TOKEN(CONSTANT); }

to:

[A-Z][a-zA-Z0-9_]* { yylval = strdup(yytext); TOKEN(CONSTANT); }

At the top of ruby.l with the other #include directives, you'll also want to add:

#include <string.h>

Note: We want to duplicate yytext because it's not a value we control and it could be freed while we are using it. If you're running this on mingw32 you might not have strdup and will need to define your own.

If you run make you'll see the following warning:

ruby.l:24:10: warning: incompatible pointer to integer conversion  
assigning to 'YYSTYPE' (aka 'int') from 'char *' [-Wint-conversion]  

This is because we want to store a char * in yylval, but it's type is int. We ideally want to be able to store both types, depending on the token. Additionally, we have a float we'd like to store as well.

In order to do this we need to change yylval into a union. Let's add the following to parse.y above the %left precedence declaration:

%union {
  int ival;
  float fval;
  char *sval;
}

Then, we'll need to change all the references of yylval to use the union values. Let's remove all the VTYPE calls as well as the VTYPE directive. You should have the following:

\"([^"]|\\.)*\" { yylval.sval = strdup(yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { yylval.sval = strdup(yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) {
  yylval.fval = atof(yytext); TOKEN(FLOAT); }
{NUMBER} { yylval.ival = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* {yylval.sval = strdup(yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* {
  yylval.sval = strdup(yytext); TOKEN(CONSTANT); }

Just to make things easier in the future, let's refactor our parser grammar to be easier to read:

program: expressions

expressions: expressions expression  
           | expression

expression: class_definition  
          | binary_expression
          | method_definition
          | variable
          | assignment
          | method_call
          | value

binary_expression: expression tPLUS expression

assignment: variable tEQUAL expression

class_definition: kCLASS tCONSTANT expressions kEND

method_definition: kDEF tID expressions kEND  
                 | kDEF tID tLPAREN tID tRPAREN expressions kEND

method_call: variable tDOT tID  
           | tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN
           | tID tSTRING

variable: tID  
        | tAT tID

value: tNUMBER  

If you were to run make now, you'd see the following error:

parse.y:37.65-66: $1 of ‘binary_expression’ has no declared type  
parse.y:37.70-71: $3 of ‘binary_expression’ has no declared type  

This is happening because we have now defined yylval as a union and each rule needs to reduce to an acceptable values for each rule the terminal is used in. We haven't defined what $1 and $3 are supposed to be in:

binary_expression: expression tPLUS expression {  
  printf("%d\n", $1 + $3); }

For now, just remove the { printf("%d\n", $1 + $3); } part. Running make && ./ruby program.rb should produce no compile or runtime errors.

In the next post, we'll make our parser reentrant!

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 8 of this series.