Reconstructing Ruby, Part 5: The dreaded shift/reduce

Read Part 4 in case you missed it.

Sorry for the months of silence. While this post is going to be short, I'm getting back to working on new content for the series.

So one problem that you'll often see when writing a parser is a shift/reduce conflict. In fact, at this point, if you run make you'll probably see the following:

bison -d parse.y  
parse.y: conflicts: 1 shift/reduce  
flex ruby.l  
cc -o ruby main.c lex.yy.c  

A shift/reduce conflict happens when we introduce an ambiguity in our grammer. An ambiguity means the parser doesn't know if it should shift the next terminal onto the stack or if it should reduce according to a specified rule. You might also encounter a reduce/reduce conflict where the parser doesn't know which rule it should use to reduce, but those are typically easy to resolve.

Given we have a shift/reduce conflict, it would be useful for us to be able to see what is causing the current shift/reduce conflict. We can do this by changing our Makefile rule for to the following: parse.y  
    bison -v -d parse.y

The -v flag is used to generate an output file which breaks down the rules and states of our parser. A state refers to how the parser has interpreted the different terminals. As it reduces these terminals to non-terminals or shifts new tokens on to its stack it will change which state it's in.

Now, when we run make it will generate the file parse.output. At the top of that file you should see something like:

State 8 conflicts: 1 shift/reduce  

If we look at state 8, we see the following:

state 8                                         

    5 expression: expression . tPLUS expression
    5           | expression tPLUS expression .

    tPLUS  shift, and go to state 7

    tPLUS     [reduce using rule 5 (expression)]
    $default  reduce using rule 5 (expression)

Here we can see that our parser is unsure whether it should shift a tPLUS token or reduce using rule 5 which reduces expression tPLUS expression to expression. A situation where this ambiguity would show up is in the program:

1 + 2 + 3  

Essentially our parser doesn't know if it should perform the operations as (1 + 2) + 3 or 1 + (2 + 3). While this isn't a problem mathematically, we should do our best to eliminate ambiquities.

Luckily for us we can easily solve this by telling the "+" token how it should reduce. In this case we'll tell it to always reduce everything to the left of it (essentially choosing the (1 + 2) + 3 path). To solve this add:

%left tPLUS

right above the %token declarations. Now when we run make we'll no longer see the shift/reduce conflict. We'll continue to utilize the parser.output file whenever a shift/reduce conflict arises.

In the next post we'll introduce a simple Abstract Syntax Tree, or AST, and the grammar a bit to cover more of the ruby language.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Reconstructing Ruby, Part 4: Our First Parser

Read Part 3 in case you missed it.

Now that our lexer is in a pretty good situation, we can start working on our parser. We'll be using Bison. Bison works very similar to Flex, but instead of generating a lexer, it will generate a parser.

Bison is an improvement on the program Yacc. It's an LALR parser generator. LALR stands for Look-Ahead Left-to-right, Rightmost derivation parser, which is a mouthful. I'm not going to even try to explain this.

In programming, a grammar refers to what tokens are allowed to be next to each other, as well as what tokens, when next to each other, represent a higher level concept.

First let's describe a program. In Ruby, a program is just a list of expressions. In programming terms, an expression is a piece of code that returns a value. In Ruby pretty much everything will return a value so we can consider everything an expression.

Other languages have an alternative to expressions called statements. Statements are like expresions but do not have a return value. Ruby technically has no statements as everything returns a value.

To start our grammar, we'll say a program is just a list of expressions:

program: expressions  

program and expressions are both known as non-terminals. You can think of a non-terminal as "not a token". The tokens on the other hand we'll refer to as terminals.

The main idea here is that every non-terminal must eventually resolve to a terminal. To some degree that makes sense because our program is just built up of tokens so we eventually want to get to a point to where we're talking about those tokens. One way to generalize a parser is to say it takes an array of tokens and from them builds a tree like structure known as an AST or Abstract Syntax Tree. You can find out more about ASTs on this Wikipedia page.

An interesting part of definining a grammar is dealing with recursion. For instance we've described a program as a list of expressions, but what is a list of expressions? The way we'd describe this in Bison would look something like this:

expressions: expressions expression  
           | expression

The | is kind of like an or. So in this case expressions are either an expression or what was previously determined to be expressions followed by another expression. The way this get's simplified is like this. Imagine we have 3 expressions following each other (I've added numbers to the end for clarity):

expression#1 expression#2 expression#3

When we parse this we're always trying to work our way to our starting non-terminal program, so we simplify the first expression to expressions

expressions expression#2 expression#3

Since an acceptable simplification of expressions would be expressions expression we can collapse them together:

expressions expression#3

Then we have a similar situation where we can simplify again to just:


Now since we can't simplify any futher expressions becomes program:


Bison files end with a .y extension due to Yacc. The format is very similar to our Flex file. The Bison manual defines it as:


Bison declarations

Grammar rules  


To start our grammar, let's work downwards from the program definition. Make a file called parse.y and put the following code into it:

    #include <stdio.h>

%token tNUMBER
%start program


program: expressions

expressions: expressions expression  
           | expression

expression: tNUMBER { printf("PARSED(%d)\n", $1); }  

Here we've defined a token tNUMBER which is a terminal. A common idiom is to prefix your tokens with the letter t. If a tNUMBER token is matched it will print $1 which is a reference back to tNUMBER because it's in the first position. If we had a larger simplification like expressions expression then expressions would be stored in $1 and expression in $2 due to their positions. We'll see more $ variables when our patterns get larger.

We've also defined the non-terminals program, expressions and expression. All of these non-terminals will eventually be expressible in tokens. A program contains expressions which is a set of expression which must be a tNUMBER.

Let's modify our Makefile to compile this now:

SRC=main.c lex.yyc  
all: ruby                                     

ruby: ${SRC}  
> cc -o ruby ${SRC}

lex.yy.c: ruby.l  
> flex ruby.l parse.y  
> bison -d parse.y

> rm -rf ruby lex.yy.c

We've introduced SRC to make adding new files easier. Now as the list of files we need to compile grows we can specify them in one locations.

The -d flag for Bison produces the header file This header file will include definitions for tokens, like tNUMBER, which is useful for our lexer since all of the tokens will be defined in our parser.

If you run make you'll probably get this noise:

$ make
bison -d parse.y  
flex ruby.l  
cc -o ruby main.c lex.yy.c warning: implicit declaration of function 'yylex' is invalid in C99 [-Wimplicit-function-declaration]  
      yychar = YYLEX;
               ^ note: expanded from macro 'YYLEX'  
# define YYLEX yylex ()
               ^ warning: implicit declaration of function 'yyerror' is invalid in C99 [-Wimplicit-function-declaration]  
      yyerror (YY_("syntax error"));
2 warnings generated.  
Undefined symbols for architecture x86_64:  
  "_yyerror", referenced from:
      _yyparse in parse-d9f59c.o
ld: symbol(s) not found for architecture x86_64  
clang: error: linker command failed with exit code 1 (use -v to see invocation)  
make: *** [ruby] Error 1  

The reason this is happening is because in order for Bison to work, it expects two functions to be defined - yylex and yyerror. We'll make use of extern again to reference yylex and we'll define a yyerror. At the top of our parse.y file add the following:

  #include <stdio.h>
  extern int yylex(void);
  void yyerror(char const *s) { fprintf(stderr, "%s\n", s); }

Now we'll need to have our main.c file call yyparse instead of yylex. At the top of our main.c add:

#include ""


extern int yylex(void)  

And replace the call to yylex with yyparse. If we change program.rb to contain just the number 1, compile and run our program, then we'll see:

$./ruby program.rb 
syntax error  

Oops, we'll need to change our lexer to return the correct token. Open ruby.l and replace:

{NUMBER} { VTYPE("NUMBER", yytext); }


{NUMBER} { yylval = atoi(yytext); return tNUMBER; }

yylval holds the value of our token, which is currently the integer value of yytext thanks to atoi. Next we return the token type tNUMBER.

The last change we need to make to ruby.l is to add:

#include ""

Right under:

#include <stdio.h>

This lets our lexer knows about the tokens we've defined in our parser. When we compile and run our program now we should see:

./ruby program.rb

From this point we're ready to expand our grammar, define more tokens, and refine our lexer to utilize them. Let's start simple. Let's change program.rb to contain:

1 + 3  

And when we run the program we want the output to be 4. Running the program currently outputs:


Notice that while PLUS is printed it wasn't parsed. It was instead printed from our lexer. We'll fix that by adding a token for PLUS that we can return from our lexer. Add:

%token tPLUS

Below the tNUMBER token. Then change:

"+" { TYPE("PLUS"); }  


"+" { return tPLUS; }  

If you run make and run the program now we'll end up with a syntax error. The reason for this is that we haven't described in our grammar a valid "sentence" with plus in it. So let's extend out expression grammer a little. We're going to change:

expression: tNUMBER { printf("PARSED(%d)\n", $1); }  


expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }

Running the program now will produce the output of 4. The reason for this is because when the parser encounters a tNUMBER it will just return that value (which is currently an integer). So when it sees an expression of the form expression tPLUS expression that's the same as 1 tPLUS 3 which in our case $1 = 1 and $3 = 3. If our program.rb had 2 + 4 in it then $1 = 2 and $3 = 4 and the output would be 6.

At this point we're ready to extend out grammar to cover more of the ruby language. In the next post We'll define some more complex expressions in our grammar so that:

# The Greeter class
class Greeter  
  def initialize(name)
    @name = name.capitalize

  def salute
    puts "Hello #{@name}!"

# Create a new object
g ="world")

# Output "Hello World!"

will be considered a valid program by our language.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 5 of this series.

Reconstructing Ruby, Part 3: More Tokens

Please check out Part 2 if you missed it!

In this post we're going to be adding a lot of tokens to our lexer. Most of these tokens will be rather straight-forward, but a few will be rather interesting. Let's start by making it easier for us to print tokens. Add the following C macros in the directives section of our Flex file right below #include <stdio.h>:

#define VTYPE(type, value) printf("%s(%s)\n", type, value)
#define TYPE(type) printf("%s\n", type)

We're using VTYPE to mean a token type that has a value (e.g. NUMBER(4), STRING("Hello Wombat"), ID(wombat)). A token using TYPE will not have an associated value (e.g. =). Let's update our number rule to make use of VTYPE:

[0-9]+ { VTYPE("NUMBER", yytext); }

Compiling and running our program should produce:

$ ./ruby program.rb 

Let's change our program.rb file to contain:

x = 5  

Running our program produces:

$ ./ruby program.rb 
Unknown token x  
Unknown token  
Unknown token =  
Unknown token  

Since spaces are a bit difficult to see, let's make one small modification to our Unknown token string:

. { fprintf(stderr, "Unknown token '%s'\n", yytext); }

Now when we run our program we get:

$ ./ruby program.rb 
Unknown token 'x'  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  

Let's fix each of these unknown tokens in order. First, in Ruby, x is what we call an identifier. An identifier is something Ruby needs to lookup in order to determine whether it's a local variable, a method, or if it's undefined. An identifier in ruby specifically starts with a lowercase letter or an underscore and contains lowercase and uppercase letters as well as underscores and numbers in the rest of the name. We'll write a rule to capture this:

[a-z_][a-zA-Z0-9_]* { VTYPE("ID", yytext); }

Running the program again we get:

$ ./ruby program.rb 
Unknown token ' '  
Unknown token '='  
Unknown token ' '  

We can handle whitespace with:

[\t ] {}

Lastly we'll add a rule for the =:

"=" { TYPE("EQUAL"); }  

Now we get:

$ ./ruby program.rb

Fantastic! Our lexer recognizes our entire program! Let's add some additional TYPE token rules:

">" { TYPE("GT"); }  
"<" { TYPE("LT"); }  
">=" { TYPE("GTE"); }  
"<=" { TYPE("LTE"); }  
"!=" { TYPE("NEQUAL"); }  
"+" { TYPE("PLUS"); }  
"-" { TYPE("MINUS"); }  
"*" { TYPE("MULT"); }  
"/" { TYPE("DIV"); }  
"%" { TYPE("MOD"); }  
"!" { TYPE("EMARK"); }  
"?" { TYPE("QMARK"); }  
"&" { TYPE("AND"); }  
"|" { TYPE("OR"); }  
"[" { TYPE("LSBRACE"); }  
"]" { TYPE("RSBRACE"); }  
"(" { TYPE("LPAREN"); }  
")" { TYPE("RPAREN"); }  
"{" { TYPE("LBRACE"); }  
"}" { TYPE("RBRACE"); }  
"@" { TYPE("AT"); }  
"." { TYPE("DOT"); }  
"," { TYPE("COMMA"); }  
":" { TYPE("COLON"); }  

Now that we have a bunch ot TYPE tokens, let's write some VTYPE token rules. We'll start with floats. Floats are interesting because they have multiple forms. You can see here how Ruby determines if a number is a float (pardon the C-explosion):

for (;;) {  
  switch (c) {
    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
      nondigit = 0;

    case '.':
      if (nondigit) goto trailing_uc;
      if (seen_point || seen_e) {
        goto decode_num;
      } else {
        int c0 = nextc();
        if (c0 == -1 || !ISDIGIT(c0)) {
          goto decode_num;
        c = c0;
      seen_point = toklen();
      nondigit = 0;

    case 'e':
    case 'E':
      if (nondigit) {
        c = nondigit;
        goto decode_num;
      if (seen_e) {
        goto decode_num;
      nondigit = c;
      c = nextc();
      if (c != '-' && c != '+' && !ISDIGIT(c)) {
        nondigit = 0;
        goto decode_num;
      nondigit = (c == '-' || c == '+') ? c : 0;

    case '_':   /* `_' in number just ignored */
      if (nondigit) goto decode_num;
      nondigit = c;

      goto decode_num;

There are a couple of interesting things happening in that function. First off if we see a . or an e or E we know we're talking about a float. Ruby designates this by is_float++ (since 0 is false in C). If Ruby sees an e or E it reads the next character to see if it's a + or - or a digit. If it's none of those we decode the number (which will generate an error at that point since the e-form needs a number afterwards.

There's a case we're not handling for regular numbers. In Ruby, a number can have _ in it. Ruby ignores this underscore when parsing numbers (but only a single _; while 1_2 is a valid number 1__2 is not). Before we get to floats, let's modify our definition of number to handle the case with _'s. If we change program.rb to say:

x = 5_1  

Then our output is:

$ ./ruby program.rb 

We can fix this by modifying our rule for NUMBER:

[0-9](_[0-9]|[0-9])* { VTYPE("NUMBER", yytext); }

Now we'll get the output:

$ ./ruby program.rb

Great! It would also be great if we could get rid of the underscore in our number, but we'll handle that in the future.

Moving forward, having a way to reference our NUMBER pattern will be useful. We'll want to make use of it in our FLOAT rule. So we're going to pull out the pattern into a definition. Definitions are reusable patterns that we can reference in Flex by wrapping them like '{DEFINITION}'. Definitions go at the top of our Flex file along with any directives. So right below:

%option noyywrap 

we'll add:

NUMBER [0-9](_[0-9]|[0-9])*  

So here we've created a definition with the name NUMBER that we can reference in our rules. Let's change our current number rule to use the definition we wrote:

{NUMBER} { VTYPE("NUMBER", yytext); }

Now that we've pulled out a definition for numbers, we can use it to write a rule for handling floats. First, let's add all the forms of floats we can come up with to our program.rb:


This should indicate fairly well if our float rule covers all of our cases. A couple of regular numbers were added to make sure we're still capturing regular numbers correctly. Let's add the following rule right above our number rule:

{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) { VTYPE("FLOAT", yytext); }

Let's run that against our program:

$ ./ruby program.rb

Great! It looks like everything is in order for floats. Now that we have floats working, we can add the rules for strings. Let's start with the rule:

\"[^"]*\" { VTYPE("STRING", yytext); }

This rule says we're looking for a double quote followed by zero or more characters that are not a double quote followed by a double quote. That will work for strings like "Hello World", but what about a string like "Hello \"World\"". The general rule about using \ as an escape is that any character can follow it. So let's modify our rule just a little to reflect this:

\"([^"]|\\.)*\" { VTYPE("STRING", yytext); }

Now we're saying that a string contains any character that is not a double quote, or any character preceeded by a \. The last thing we'll need to do is repeat the same thing for single quotes.

\'([^']|\\.)*\' { VTYPE("STRING", yytext); }

In the future, we'll need to take into account the fact that single quoted strings are not interpolated and double quoted are. Additionally there's a bunch of string madness around %, but we're going to ignore that for now.

Let's move on to constants. Constants are just like our ids except they'll start with a capital letter:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); }

Alright, now that we have our rules, let's try them out against a large ruby program. Let's change our program.rb to the following:

class Greeter  
  def initialize(name)
    @name = name.capitalize

  def salute
    puts "Hello #{@name}!"

g ="world")


It we run our lexer against it we get the following:

./ruby program.rb 
STRING("Hello #{@name}!")  

Let's compare that to ripper's output:

> require 'pp'
=> true
> require 'ripper'
=> true
> pp Ripper.lex("program.rb"))
[[[1, 0], :on_kw, "class"],
 [[1, 5], :on_sp, " "],
 [[1, 6], :on_const, "Greeter"],
 [[1, 13], :on_nl, "\n"],
 [[2, 0], :on_sp, "  "],
 [[2, 2], :on_kw, "def"],
 [[2, 5], :on_sp, " "],
 [[2, 6], :on_ident, "initialize"],
 [[2, 16], :on_lparen, "("],
 [[2, 17], :on_ident, "name"],
 [[2, 21], :on_rparen, ")"],
 [[2, 22], :on_ignored_nl, "\n"],
 [[3, 0], :on_sp, "    "],
 [[3, 4], :on_ivar, "@name"],
 [[3, 9], :on_sp, " "],
 [[3, 10], :on_op, "="],
 [[3, 11], :on_sp, " "],
 [[3, 12], :on_ident, "name"],
 [[3, 16], :on_period, "."],
 [[3, 17], :on_ident, "capitalize"],
 [[3, 27], :on_nl, "\n"],
 [[4, 0], :on_sp, "  "],
 [[4, 2], :on_kw, "end"],
 [[4, 5], :on_nl, "\n"],
 [[5, 0], :on_ignored_nl, "\n"],
 [[6, 0], :on_sp, "  "],
 [[6, 2], :on_kw, "def"],
 [[6, 5], :on_sp, " "],
 [[6, 6], :on_ident, "salute"],
 [[6, 12], :on_nl, "\n"],
 [[7, 0], :on_sp, "    "],
 [[7, 4], :on_ident, "puts"],
 [[7, 8], :on_sp, " "],
 [[7, 9], :on_tstring_beg, "\""],
 [[7, 10], :on_tstring_content, "Hello "],
 [[7, 16], :on_embexpr_beg, "\#{"],
 [[7, 18], :on_ivar, "@name"],
 [[7, 23], :on_embexpr_end, "}"],
 [[7, 24], :on_tstring_content, "!"],
 [[7, 25], :on_tstring_end, "\""],
 [[7, 26], :on_nl, "\n"],
 [[8, 0], :on_sp, "  "],
 [[8, 2], :on_kw, "end"],
 [[8, 5], :on_nl, "\n"],
 [[9, 0], :on_kw, "end"],
 [[9, 3], :on_nl, "\n"],
 [[10, 0], :on_ignored_nl, "\n"],
 [[11, 0], :on_ident, "g"],
 [[11, 1], :on_sp, " "],
 [[11, 2], :on_op, "="],
 [[11, 3], :on_sp, " "],
 [[11, 4], :on_const, "Greeter"],
 [[11, 11], :on_period, "."],
 [[11, 12], :on_ident, "new"],
 [[11, 15], :on_lparen, "("],
 [[11, 16], :on_tstring_beg, "\""],
 [[11, 17], :on_tstring_content, "world"],
 [[11, 22], :on_tstring_end, "\""],
 [[11, 23], :on_rparen, ")"],
 [[11, 24], :on_nl, "\n"],
 [[12, 0], :on_ignored_nl, "\n"],
 [[13, 0], :on_ident, "g"],
 [[13, 1], :on_period, "."],
 [[13, 2], :on_ident, "salute"],
 [[13, 8], :on_nl, "\n"]]

Ripper is part of Ruby's stdlib, or standard library, and it provides a lot of insight into how ruby processes our code. While there are some large differences in how strings are being handled, keywords, and ivars, we'll cross those bridges when we get to them.

One last rule I'd like to add is for handling comments. Let's add this to the top of our rules:

#.*$ {}

Now comments will be ignored.

For here, our lexer is in a great state to start making our parser. In the next post we'll start generating our grammar for our parser.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 4 of this series.