Reconstructing Ruby, Part 8: Make our parser reentrant

Read Part 7 in case you missed it.

Reentrantcy basically means our parser doesn't use any global state. This will eventually allow us to write threaded programs. Since Ruby supports threading (kind of) we'll want to make our parser reentrant. We'll talk more about threading once we start working on the actual implementation, but for now, we'll just make the necessary changes to our parser.

Luckily bison provides a convenient directive for generating a reentrant parser. All we need to do is add the following to our parser:

%pure-parser

This makes it so yylval, yylloc, and other internal variables are no longer global variables but local. Since they are all now local variables we're we could potentially run yyparse in multiple threads without consequence.

At this point, running make will generate a bunch of errors. We'll tackle this set first:

parse.tab.c:1368:16: error: too many arguments to function  
  call, expected 0, have 1
      yychar = YYLEX;
               ^~~<del>
parse.tab.c:713:23: note: expanded from macro 'YYLEX'  
# define YYLEX yylex (&yylval)
               </del>~<del>  ^</del>~~~~
parse.y:29:3: note: 'yylex' declared here  
  extern int yylex(void);
  ^
1 error generated.  

The first error is telling us that yylex now takes an argument. In this case, yylval needs to be passed in. We'll need to change our extern to reflect this. Move this line:

  extern int yylex(void);

below our %union directive and change void to YYSTYPE *yylval like so:

%union {
  int ival;
  float fval;
  char *sval;
}

%{
  extern int yylex(YYSTYPE *yylval);
%}

We need to have this declaration below %union because the union directive will define the type YYSTYPE.

To fix the next error:

ruby.l:21:3: error: use of undeclared identifier 'yylval'  
{ yylval.sval = strdup(yytext); TOKEN(STRING); }

We'll need to change the definition of yylex in our lexer. It should now accept yylval as an argument. We'll do this by definining a YY_DECL macro in our ruby.l file like so:

%{
  #include <stdio.h>
  #include <string.h>
  #include "node.h"
  #include "parse.tab.h"
  #define YY_DECL int yylex(YYSTYPE *yylval)
  #define TOKEN(id) return t##id
  #define KEYWORD(id) return k##id
%}

If you run make now, you'll see it complains about yylval being a pointer and that we should use -> instead of .. Let's make those changes to our lexer rules:

\"([^"]|\\.)*\" { yylval->sval = strdup(yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { yylval->sval = strdup(yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) { yylval->fval = atof(yytext); TOKEN(FLOAT); }
{NUMBER} { yylval->ival = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* { yylval->sval = strdup(yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* { yylval->sval = strdup(yytext); TOKEN(CONSTANT); }

If we run make one last time everything compiles correctly. We now have a reentrant parser!

One other thing that will be useful to setup now is to change our parser to take a parameter. This parameter will be a struct that holds information about the current parser status. We'll start simple by only storing the filename and line number for now.

First let's create the file node.h where we'll store our struct definition. In the future, we'll use this file to define some nodes for our Abstract Syntax Tree or AST. Inside of the node.h file, add the following:

typedef struct parser_state {  
    char *source_file;
    int source_line;
} parser_state;

In the actual ruby implementation, this struct is known as parser_params. source_file and source_line are both parser_ruby_sourcefile and parser_ruby_sourceline respectively.

Now that we've defined our struct we'll need to include node.h before we include parse.tab.h anywhere. You'll need to make this change in both main.c and ruby.l. You'll also want to include node.h at the top of parse.y.

Now right below %pure-parser add the following lines:

%parse-param { parser_state *state }
%lex-param { state }

The first line changes yyparseto take in an argument to manage it's state. The second line also passes this state to the lexer. We'll need to make some changes to yylex as well. First we'll need to change our forward declaration of yylex in parse.y to the following:

%{                                                       
  extern int yylex(YYSTYPE *yylval, parser_state *state);
%}                   

Then you'll change this line in ruby.l:

#define YY_DECL int yylex(YYSTYPE *yylval)

to:

#define YY_DECL int yylex(YYSTYPE *yylval, parser_state *state)

Then in order to pull this all together we'll need to change main so that it passes in a parser_state, and additionally, we'll set the filename:

int main(int argc, char *argv[]) {  
  parser_state state = { NULL, 0 };

  if(argc > 1) {
    state.source_file = strdup(argv[1]);
    yyin = fopen(argv[1], "r");
  }

  yyparse(&state);
  return EXIT_SUCCESS;
}

You'll also need to add #include <string.h> to the top of your main.c in order to support strdup. Lastly, in order to see our parser state working correctly change the program rule in parse.y to the following:

program: expressions { printf("%s\n", state->source_file); }  

Now let's try it out:

$ make
$ ./ruby program.rb
program.rb  

Nice it works!

One problem we currently have in our implementation is that we're leaking memory. In the next post I'll discuss setting up Valgrind, a memory profiler, using Docker (since OSX doesn't support Valgrind very well) in order to catch and fix our memory leaks.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Reconstructing Ruby, Part 7: Expanding our grammar

Read Part 6 in case you missed it.

When we left off, we were getting a syntax error from running ./ruby program.rb. If you're up to date, then you should see the following:

ID(class)  
syntax error. Unexpected "class" on line 2  

We can see that the lexer is identifying class as an ID and that our grammar doesn't expect to see a token of typeID. Ideally, our lexer would treat this as a keyword since class is something special in Ruby.

The typical definition for a class in ruby looks something like:

class Foo  
  # expressions
end  

Since defining a class requires two keywords class and endlet's add them bost as keywords. In order to do this, we'll add the following rules to the top of rules in ruby.l:

class { KEYWORD(CLASS); }  
end { KEYWORD(END); }  

And then add the following macro below our TOKEN macro:

#define KEYWORD(id) return k##id

In our parse.y file we'll want to add tokens for our keywords:

%token kCLASS kEND

Now we'll add a rule to our grammar for defining classes. Modify expressions to contain the following:

expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }
          | kCLASS tCONSTANT expressions kEND

Now when we run make && ./ruby program.rb we get the following:

CONSTANT(Greeter)  
ID(def)  
syntax error. Unexpected "def" on line 3  

For def, we'll follow the same process we did for class. The grammar rules will look a little different:

| kDEF tID expressions kEND
| kDEF tID tLPAREN tID tRPAREN expressions kEND

While this doesn't handle all of the cases yet, it handles both of our method definitions in our example app. Now when we run make && ./ruby program.rb we see the following:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
syntax error. Unexpected "@" on line 4  

Let's define rules in our grammar for variable:

variable: tID  
        | tAT tID

Then expand expression to contain:

| variable

Now our error message is:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
syntax error. Unexpected "=" on line 4  

So let's add assignment to our grammar.

assignment: variable tEQUAL expression  
expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }
          | kCLASS tCONSTANT expressions kEND
          | kDEF tID expressions kEND
          | kDEF tID tLPAREN tID tRPAREN expressions kEND
          | variable
          | assignment

If we run make && ./ruby program.rb we see the dreaded shift/reduce...

parse.y: conflicts: 1 shift/reduce  

This one is fairly easy to resolve. In the case of equality we want everything to the right of the equal sign to be resolved first so we're going to add the following to parse.y:

%right tEQUAL

Now when we run make && ./ruby program.rb we don't see a shift reduce, but instead get the following error:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
ID(name)  
syntax error. Unexpected "." on line 4  

Now we'll write grammar rules for all of the method calls in our example program:

method_call: variable tDOT tID  
           | tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN
           | tID tSTRING

And add method_call to our expression rule:

| method_call

Again these are very specific to our program, but we'll extend it later on. Now when we run our program we see the following:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
ID(name)  
ID(capitalize)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(g)  
CONSTANT(Greeter)  
ID(new)  
STRING("world")  
ID(g)  
ID(salute)  

So, we no longer have any syntax errors, but we shouldn't be printing the token type and value. It would be great if we could both identify the token and pass the value like we did with numbers:

{NUMBER} { yylval = atoi(yytext); TOKEN(NUMBER); }

Let's change CONSTANT to do exactly this. Change:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); TOKEN(CONSTANT); }

to:

[A-Z][a-zA-Z0-9_]* { yylval = strdup(yytext); TOKEN(CONSTANT); }

At the top of ruby.l with the other #include directives, you'll also want to add:

#include <string.h>

Note: We want to duplicate yytext because it's not a value we control and it could be freed while we are using it. If you're running this on mingw32 you might not have strdup and will need to define your own.

If you run make you'll see the following warning:

ruby.l:24:10: warning: incompatible pointer to integer conversion  
assigning to 'YYSTYPE' (aka 'int') from 'char *' [-Wint-conversion]  

This is because we want to store a char * in yylval, but it's type is int. We ideally want to be able to store both types, depending on the token. Additionally, we have a float we'd like to store as well.

In order to do this we need to change yylval into a union. Let's add the following to parse.y above the %left precedence declaration:

%union {
  int ival;
  float fval;
  char *sval;
}

Then, we'll need to change all the references of yylval to use the union values. Let's remove all the VTYPE calls as well as the VTYPE directive. You should have the following:

\"([^"]|\\.)*\" { yylval.sval = strdup(yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { yylval.sval = strdup(yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) {
  yylval.fval = atof(yytext); TOKEN(FLOAT); }
{NUMBER} { yylval.ival = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* {yylval.sval = strdup(yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* {
  yylval.sval = strdup(yytext); TOKEN(CONSTANT); }

Just to make things easier in the future, let's refactor our parser grammar to be easier to read:

program: expressions

expressions: expressions expression  
           | expression

expression: class_definition  
          | binary_expression
          | method_definition
          | variable
          | assignment
          | method_call
          | value

binary_expression: expression tPLUS expression

assignment: variable tEQUAL expression

class_definition: kCLASS tCONSTANT expressions kEND

method_definition: kDEF tID expressions kEND  
                 | kDEF tID tLPAREN tID tRPAREN expressions kEND

method_call: variable tDOT tID  
           | tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN
           | tID tSTRING

variable: tID  
        | tAT tID

value: tNUMBER  

If you were to run make now, you'd see the following error:

parse.y:37.65-66: $1 of ‘binary_expression’ has no declared type  
parse.y:37.70-71: $3 of ‘binary_expression’ has no declared type  

This is happening because we have now defined yylval as a union and each rule needs to reduce to an acceptable values for each rule the terminal is used in. We haven't defined what $1 and $3 are supposed to be in:

binary_expression: expression tPLUS expression {  
  printf("%d\n", $1 + $3); }

For now, just remove the { printf("%d\n", $1 + $3); } part. Running make && ./ruby program.rb should produce no compile or runtime errors.

In the next post, we'll make our parser reentrant!

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 8 of this series.

Reconstructing Ruby, Part 6: Getting useful syntax errors

Read Part 5 in case you missed it.

We're going to start expanding our parser to better understand ruby programs. The example program we're going to target first is this one from the Ruby homepage:

# The Greeter class
class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

# Create a new object
g = Greeter.new("world")

# Output "Hello World!"
g.salute  

Let's changeyour program.rb file to match the code above and then run ./ruby program.rb. You should see the following failure:

ID(class)  
CONSTANT(Greeter)  
ID(def)  
ID(initialize)  
LPAREN  
ID(name)  
RPAREN  
AT  
ID(name)  
EQUAL  
ID(name)  
DOT  
ID(capitalize)  
ID(end)  
ID(def)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(end)  
ID(end)  
ID(g)  
EQUAL  
CONSTANT(Greeter)  
DOT  
ID(new)  
LPAREN  
STRING("world")  
RPAREN  
ID(g)  
DOT  
ID(salute)  
syntax error  

That's not a very descriptive error message. To get this to the point where we can actually get useful information about our errors let's move our yyerror function from parse.y to the bottom of ruby.l and make the following changes:

%%

void yyerror(char const *s) {  
  fprintf(stderr,
          "%s. Unexpected \"%s\" on line %d\n",
          s,
          yytext,
          yylineno);
}

yytext will contain the text of the current token and yylineno will contain the line number of that token. yylineno isn't provided by default so we'll have to specify the option in our ruby.l file. Right after the option for noyywrapadd:

%option yylineno

Now, much like we did for yylex in parse.y, we'll need to add the following:

extern void yyerror(const char *s);  

If you run make && ./ruby program.rb again you'll get something rather unexpected:

syntax error. Unexpected "" on line 17  

The reason the token text is empty is because we're not actually generating any tokens in our lexer (except for tNUMBER and tPLUS).

We're going to change our lexer to stop printing output and produce tokens instead. Everywhere in ruby.l where we have TYPE("SOMETHING") we're going to change to TOKEN(SOMETHING). Note the lack of double quotes there, this will be important when we define our macro. If you're using VIM you can type the following :%s/TYPE("\(.*\)")/TOKEN(\1)/<ENTER> to make the change, but you'll need to manually change return tPLUS to TOKEN(PLUS). Then for each of our VTYPE rules we'll also add a TOKEN call. Your rules will look like this:

%%
#.*$ {}
\"([^"]|\\.)*\" { VTYPE("STRING", yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { VTYPE("STRING", yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) {
  VTYPE("FLOAT", yytext); TOKEN(FLOAT); }
{NUMBER} { yylval = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* { VTYPE("ID", yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); TOKEN(CONSTANT); }
"=" { TOKEN(EQUAL); }  
">" { TOKEN(GT); }  
"<" { TOKEN(LT); }  
">=" { TOKEN(GTE); }  
"<=" { TOKEN(LTE); }  
"!=" { TOKEN(NEQUAL); }  
"+" { TOKEN(PLUS); }  
"-" { TOKEN(MINUS); }  
"*" { TOKEN(MULT); }  
"/" { TOKEN(DIV); }  
"%" { TOKEN(MOD); }  
"!" { TOKEN(EMARK); }  
"?" { TOKEN(QMARK); }  
"&" { TOKEN(AND); }  
"|" { TOKEN(OR); }  
"[" { TOKEN(LSBRACE); }  
"]" { TOKEN(RSBRACE); }  
"(" { TOKEN(LPAREN); }  
")" { TOKEN(RPAREN); }  
"{" { TOKEN(LBRACE); }  
"}" { TOKEN(RBRACE); }  
"@" { TOKEN(AT); }  
"." { TOKEN(DOT); }  
"," { TOKEN(COMMA); }  
":" { TOKEN(COLON); }  
[\t ] {}
\n {}
. { fprintf(stderr, "Unknown token '%s'\n", yytext); }
%%

At the top of our file let's replace our TYPE macro with this TOKEN macro:

%define TOKEN(id) return t##id

If you haven't seen the ## before this is part of the c-preprocessor language which will prepend t in front of the value of id. Essentually TOKEN(EQUAL); will expand to return tEQUAL; which is the format of our tokens.

Now we need to define a tokens in our parser for all the tokens we're using in our lexer. Inside of parse.y change the lines:

%token tNUMBER
%token tPLUS

to:

%token tSTRING tFLOAT tNUMBER tID tCONSTANT tEQUAL tGT tLT tGTE tLTE
%token tNEQUAL tPLUS tMINUS tMULT tDIV tMOD tEMARK tQMARK tAND tOR
%token tLSBRACE tRSBRACE tLPAREN tRPAREN tLBRACE tRBRACE tAT tDOT 
%token tCOMMA tCOLON 

Now if you run make && ./ruby program.rb you should see the following syntax error:

ID(class)  
syntax error. Unexpected "class" on line 2  

This is a far more reasonable error message that we can start doing something about. In the next post we'll fix all our syntax errors to start filling out our grammar.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 7 of this series.