Reconstructing Ruby, Part 6: Getting useful syntax errors

Read Part 5 in case you missed it.

We're going to start expanding our parser to better understand ruby programs. The example program we're going to target first is this one from the Ruby homepage:

# The Greeter class
class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

# Create a new object
g = Greeter.new("world")

# Output "Hello World!"
g.salute  

Let's changeyour program.rb file to match the code above and then run ./ruby program.rb. You should see the following failure:

ID(class)  
CONSTANT(Greeter)  
ID(def)  
ID(initialize)  
LPAREN  
ID(name)  
RPAREN  
AT  
ID(name)  
EQUAL  
ID(name)  
DOT  
ID(capitalize)  
ID(end)  
ID(def)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(end)  
ID(end)  
ID(g)  
EQUAL  
CONSTANT(Greeter)  
DOT  
ID(new)  
LPAREN  
STRING("world")  
RPAREN  
ID(g)  
DOT  
ID(salute)  
syntax error  

That's not a very descriptive error message. To get this to the point where we can actually get useful information about our errors let's move our yyerror function from parse.y to the bottom of ruby.l and make the following changes:

%%

void yyerror(char const *s) {  
  fprintf(stderr,
          "%s. Unexpected \"%s\" on line %d\n",
          s,
          yytext,
          yylineno);
}

yytext will contain the text of the current token and yylineno will contain the line number of that token. yylineno isn't provided by default so we'll have to specify the option in our ruby.l file. Right after the option for noyywrapadd:

%option yylineno

Now, much like we did for yylex in parse.y, we'll need to add the following:

extern void yyerror(const char *s);  

If you run make && ./ruby program.rb again you'll get something rather unexpected:

syntax error. Unexpected "" on line 17  

The reason the token text is empty is because we're not actually generating any tokens in our lexer (except for tNUMBER and tPLUS).

We're going to change our lexer to stop printing output and produce tokens instead. Everywhere in ruby.l where we have TYPE("SOMETHING") we're going to change to TOKEN(SOMETHING). Note the lack of double quotes there, this will be important when we define our macro. If you're using VIM you can type the following :%s/TYPE("\(.*\)")/TOKEN(\1)/<ENTER> to make the change, but you'll need to manually change return tPLUS to TOKEN(PLUS). Then for each of our VTYPE rules we'll also add a TOKEN call. Your rules will look like this:

%%
#.*$ {}
\"([^"]|\\.)*\" { VTYPE("STRING", yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { VTYPE("STRING", yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) {
  VTYPE("FLOAT", yytext); TOKEN(FLOAT); }
{NUMBER} { yylval = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* { VTYPE("ID", yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); TOKEN(CONSTANT); }
"=" { TOKEN(EQUAL); }  
">" { TOKEN(GT); }  
"<" { TOKEN(LT); }  
">=" { TOKEN(GTE); }  
"<=" { TOKEN(LTE); }  
"!=" { TOKEN(NEQUAL); }  
"+" { TOKEN(PLUS); }  
"-" { TOKEN(MINUS); }  
"*" { TOKEN(MULT); }  
"/" { TOKEN(DIV); }  
"%" { TOKEN(MOD); }  
"!" { TOKEN(EMARK); }  
"?" { TOKEN(QMARK); }  
"&" { TOKEN(AND); }  
"|" { TOKEN(OR); }  
"[" { TOKEN(LSBRACE); }  
"]" { TOKEN(RSBRACE); }  
"(" { TOKEN(LPAREN); }  
")" { TOKEN(RPAREN); }  
"{" { TOKEN(LBRACE); }  
"}" { TOKEN(RBRACE); }  
"@" { TOKEN(AT); }  
"." { TOKEN(DOT); }  
"," { TOKEN(COMMA); }  
":" { TOKEN(COLON); }  
[\t ] {}
\n {}
. { fprintf(stderr, "Unknown token '%s'\n", yytext); }
%%

At the top of our file let's replace our TYPE macro with this TOKEN macro:

%define TOKEN(id) return t##id

If you haven't seen the ## before this is part of the c-preprocessor language which will prepend t in front of the value of id. Essentually TOKEN(EQUAL); will expand to return tEQUAL; which is the format of our tokens.

Now we need to define a tokens in our parser for all the tokens we're using in our lexer. Inside of parse.y change the lines:

%token tNUMBER
%token tPLUS

to:

%token tSTRING tFLOAT tNUMBER tID tCONSTANT tEQUAL tGT tLT tGTE tLTE
%token tNEQUAL tPLUS tMINUS tMULT tDIV tMOD tEMARK tQMARK tAND tOR
%token tLSBRACE tRSBRACE tLPAREN tRPAREN tLBRACE tRBRACE tAT tDOT 
%token tCOMMA tCOLON 

Now if you run make && ./ruby program.rb you should see the following syntax error:

ID(class)  
syntax error. Unexpected "class" on line 2  

This is a far more reasonable error message that we can start doing something about. In the next post we'll fix all our syntax errors to start filling out our grammar.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 7 of this series.