Reconstructing Ruby, Part 3: More Tokens

Please check out Part 2 if you missed it!

In this post we're going to be adding a lot of tokens to our lexer. Most of these tokens will be rather straight-forward, but a few will be rather interesting. Let's start by making it easier for us to print tokens. Add the following C macros in the directives section of our Flex file right below #include <stdio.h>:

#define VTYPE(type, value) printf("%s(%s)\n", type, value)
#define TYPE(type) printf("%s\n", type)

We're using VTYPE to mean a token type that has a value (e.g. NUMBER(4), STRING("Hello Wombat"), ID(wombat)). A token using TYPE will not have an associated value (e.g. =). Let's update our number rule to make use of VTYPE:

[0-9]+ { VTYPE("NUMBER", yytext); }

Compiling and running our program should produce:

$ ./ruby program.rb 
NUMBER(4)  
NUMBER(8)  
NUMBER(15)  
NUMBER(16)  
NUMBER(23)  
NUMBER(42)  

Let's change our program.rb file to contain:

x = 5  

Running our program produces:

$ ./ruby program.rb 
Unknown token x  
Unknown token  
Unknown token =  
Unknown token  
NUMBER(5)  

Since spaces are a bit difficult to see, let's make one small modification to our Unknown token string:

. { fprintf(stderr, "Unknown token '%s'\n", yytext); }

Now when we run our program we get:

$ ./ruby program.rb 
Unknown token 'x'  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  
NUMBER(5)  

Let's fix each of these unknown tokens in order. First, in Ruby, x is what we call an identifier. An identifier is something Ruby needs to lookup in order to determine whether it's a local variable, a method, or if it's undefined. An identifier in ruby specifically starts with a lowercase letter or an underscore and contains lowercase and uppercase letters as well as underscores and numbers in the rest of the name. We'll write a rule to capture this:

[a-z_][a-zA-Z0-9_]* { VTYPE("ID", yytext); }

Running the program again we get:

$ ./ruby program.rb 
ID(x)  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  
NUMBER(5)  

We can handle whitespace with:

[\t ] {}

Lastly we'll add a rule for the =:

"=" { TYPE("EQUAL"); }  

Now we get:

$ ./ruby program.rb
ID(x)  
EQUAL  
NUMBER(5)  

Fantastic! Our lexer recognizes our entire program! Let's add some additional TYPE token rules:

">" { TYPE("GT"); }  
"<" { TYPE("LT"); }  
">=" { TYPE("GTE"); }  
"<=" { TYPE("LTE"); }  
"!=" { TYPE("NEQUAL"); }  
"+" { TYPE("PLUS"); }  
"-" { TYPE("MINUS"); }  
"*" { TYPE("MULT"); }  
"/" { TYPE("DIV"); }  
"%" { TYPE("MOD"); }  
"!" { TYPE("EMARK"); }  
"?" { TYPE("QMARK"); }  
"&" { TYPE("AND"); }  
"|" { TYPE("OR"); }  
"[" { TYPE("LSBRACE"); }  
"]" { TYPE("RSBRACE"); }  
"(" { TYPE("LPAREN"); }  
")" { TYPE("RPAREN"); }  
"{" { TYPE("LBRACE"); }  
"}" { TYPE("RBRACE"); }  
"@" { TYPE("AT"); }  
"." { TYPE("DOT"); }  
"," { TYPE("COMMA"); }  
":" { TYPE("COLON"); }  

Now that we have a bunch ot TYPE tokens, let's write some VTYPE token rules. We'll start with floats. Floats are interesting because they have multiple forms. You can see here how Ruby determines if a number is a float (pardon the C-explosion):

for (;;) {  
  switch (c) {
    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
      nondigit = 0;
      tokadd(c);
      break;

    case '.':
      if (nondigit) goto trailing_uc;
      if (seen_point || seen_e) {
        goto decode_num;
      } else {
        int c0 = nextc();
        if (c0 == -1 || !ISDIGIT(c0)) {
          pushback(c0);
          goto decode_num;
        }
        c = c0;
      }
      seen_point = toklen();
      tokadd('.');
      tokadd(c);
      is_float++;
      nondigit = 0;
      break;

    case 'e':
    case 'E':
      if (nondigit) {
        pushback(c);
        c = nondigit;
        goto decode_num;
      }
      if (seen_e) {
        goto decode_num;
      }
      nondigit = c;
      c = nextc();
      if (c != '-' && c != '+' && !ISDIGIT(c)) {
        pushback(c);
        nondigit = 0;
        goto decode_num;
      }
      tokadd(nondigit);
      seen_e++;
      is_float++;
      tokadd(c);
      nondigit = (c == '-' || c == '+') ? c : 0;
      break;

    case '_':   /* `_' in number just ignored */
      if (nondigit) goto decode_num;
      nondigit = c;
      break;

    default:
      goto decode_num;

There are a couple of interesting things happening in that function. First off if we see a . or an e or E we know we're talking about a float. Ruby designates this by is_float++ (since 0 is false in C). If Ruby sees an e or E it reads the next character to see if it's a + or - or a digit. If it's none of those we decode the number (which will generate an error at that point since the e-form needs a number afterwards.

There's a case we're not handling for regular numbers. In Ruby, a number can have _ in it. Ruby ignores this underscore when parsing numbers (but only a single _; while 1_2 is a valid number 1__2 is not). Before we get to floats, let's modify our definition of number to handle the case with _'s. If we change program.rb to say:

x = 5_1  

Then our output is:

$ ./ruby program.rb 
ID(x)  
EQUAL  
NUMBER(5)  
ID(_1)  

We can fix this by modifying our rule for NUMBER:

[0-9](_[0-9]|[0-9])* { VTYPE("NUMBER", yytext); }

Now we'll get the output:

$ ./ruby program.rb
ID(x)  
EQUAL  
NUMBER(5_1)  

Great! It would also be great if we could get rid of the underscore in our number, but we'll handle that in the future.

Moving forward, having a way to reference our NUMBER pattern will be useful. We'll want to make use of it in our FLOAT rule. So we're going to pull out the pattern into a definition. Definitions are reusable patterns that we can reference in Flex by wrapping them like '{DEFINITION}'. Definitions go at the top of our Flex file along with any directives. So right below:

%option noyywrap 

we'll add:

NUMBER [0-9](_[0-9]|[0-9])*  

So here we've created a definition with the name NUMBER that we can reference in our rules. Let's change our current number rule to use the definition we wrote:

{NUMBER} { VTYPE("NUMBER", yytext); }

Now that we've pulled out a definition for numbers, we can use it to write a rule for handling floats. First, let's add all the forms of floats we can come up with to our program.rb:

1.0  
1.01  
1_0.0  
1_0.0_1  
1e1  
1.0e1  
1.0e+1  
1.0e-1  
1.0E1  
1.0E+1  
1.0E-1  
1_1_1.0_1_4E+1_4_0  
10  
1_0  

This should indicate fairly well if our float rule covers all of our cases. A couple of regular numbers were added to make sure we're still capturing regular numbers correctly. Let's add the following rule right above our number rule:

{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) { VTYPE("FLOAT", yytext); }

Let's run that against our program:

$ ./ruby program.rb
FLOAT(1.0)  
FLOAT(1.01)  
FLOAT(1_0.0)  
FLOAT(1_0.0_1)  
FLOAT(1e1)  
FLOAT(1.0e1)  
FLOAT(1.0e+1)  
FLOAT(1.0e-1)  
FLOAT(1.0E1)  
FLOAT(1.0E+1)  
FLOAT(1.0E-1)  
FLOAT(1_1_1.0_1_4E+1_4_0)  
NUMBER(10)  
NUMBER(1_0)  

Great! It looks like everything is in order for floats. Now that we have floats working, we can add the rules for strings. Let's start with the rule:

\"[^"]*\" { VTYPE("STRING", yytext); }

This rule says we're looking for a double quote followed by zero or more characters that are not a double quote followed by a double quote. That will work for strings like "Hello World", but what about a string like "Hello \"World\"". The general rule about using \ as an escape is that any character can follow it. So let's modify our rule just a little to reflect this:

\"([^"]|\\.)*\" { VTYPE("STRING", yytext); }

Now we're saying that a string contains any character that is not a double quote, or any character preceeded by a \. The last thing we'll need to do is repeat the same thing for single quotes.

\'([^']|\\.)*\' { VTYPE("STRING", yytext); }

In the future, we'll need to take into account the fact that single quoted strings are not interpolated and double quoted are. Additionally there's a bunch of string madness around %, but we're going to ignore that for now.

Let's move on to constants. Constants are just like our ids except they'll start with a capital letter:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); }

Alright, now that we have our rules, let's try them out against a large ruby program. Let's change our program.rb to the following:

class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

g = Greeter.new("world")

g.salute  

It we run our lexer against it we get the following:

./ruby program.rb 
ID(class)  
CONSTANT(Greeter)  
ID(def)  
ID(initialize)  
LPAREN  
ID(name)  
RPAREN  
AT  
ID(name)  
EQUAL  
ID(name)  
DOT  
ID(capitalize)  
ID(end)  
ID(def)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(end)  
ID(end)  
ID(g)  
EQUAL  
CONSTANT(Greeter)  
DOT  
ID(new)  
LPAREN  
STRING("world")  
RPAREN  
ID(g)  
DOT  
ID(salute)  

Let's compare that to ripper's output:

irb  
> require 'pp'
=> true
> require 'ripper'
=> true
> pp Ripper.lex(File.read("program.rb"))
[[[1, 0], :on_kw, "class"],
 [[1, 5], :on_sp, " "],
 [[1, 6], :on_const, "Greeter"],
 [[1, 13], :on_nl, "\n"],
 [[2, 0], :on_sp, "  "],
 [[2, 2], :on_kw, "def"],
 [[2, 5], :on_sp, " "],
 [[2, 6], :on_ident, "initialize"],
 [[2, 16], :on_lparen, "("],
 [[2, 17], :on_ident, "name"],
 [[2, 21], :on_rparen, ")"],
 [[2, 22], :on_ignored_nl, "\n"],
 [[3, 0], :on_sp, "    "],
 [[3, 4], :on_ivar, "@name"],
 [[3, 9], :on_sp, " "],
 [[3, 10], :on_op, "="],
 [[3, 11], :on_sp, " "],
 [[3, 12], :on_ident, "name"],
 [[3, 16], :on_period, "."],
 [[3, 17], :on_ident, "capitalize"],
 [[3, 27], :on_nl, "\n"],
 [[4, 0], :on_sp, "  "],
 [[4, 2], :on_kw, "end"],
 [[4, 5], :on_nl, "\n"],
 [[5, 0], :on_ignored_nl, "\n"],
 [[6, 0], :on_sp, "  "],
 [[6, 2], :on_kw, "def"],
 [[6, 5], :on_sp, " "],
 [[6, 6], :on_ident, "salute"],
 [[6, 12], :on_nl, "\n"],
 [[7, 0], :on_sp, "    "],
 [[7, 4], :on_ident, "puts"],
 [[7, 8], :on_sp, " "],
 [[7, 9], :on_tstring_beg, "\""],
 [[7, 10], :on_tstring_content, "Hello "],
 [[7, 16], :on_embexpr_beg, "\#{"],
 [[7, 18], :on_ivar, "@name"],
 [[7, 23], :on_embexpr_end, "}"],
 [[7, 24], :on_tstring_content, "!"],
 [[7, 25], :on_tstring_end, "\""],
 [[7, 26], :on_nl, "\n"],
 [[8, 0], :on_sp, "  "],
 [[8, 2], :on_kw, "end"],
 [[8, 5], :on_nl, "\n"],
 [[9, 0], :on_kw, "end"],
 [[9, 3], :on_nl, "\n"],
 [[10, 0], :on_ignored_nl, "\n"],
 [[11, 0], :on_ident, "g"],
 [[11, 1], :on_sp, " "],
 [[11, 2], :on_op, "="],
 [[11, 3], :on_sp, " "],
 [[11, 4], :on_const, "Greeter"],
 [[11, 11], :on_period, "."],
 [[11, 12], :on_ident, "new"],
 [[11, 15], :on_lparen, "("],
 [[11, 16], :on_tstring_beg, "\""],
 [[11, 17], :on_tstring_content, "world"],
 [[11, 22], :on_tstring_end, "\""],
 [[11, 23], :on_rparen, ")"],
 [[11, 24], :on_nl, "\n"],
 [[12, 0], :on_ignored_nl, "\n"],
 [[13, 0], :on_ident, "g"],
 [[13, 1], :on_period, "."],
 [[13, 2], :on_ident, "salute"],
 [[13, 8], :on_nl, "\n"]]

Ripper is part of Ruby's stdlib, or standard library, and it provides a lot of insight into how ruby processes our code. While there are some large differences in how strings are being handled, keywords, and ivars, we'll cross those bridges when we get to them.

One last rule I'd like to add is for handling comments. Let's add this to the top of our rules:

#.*$ {}

Now comments will be ignored.

For here, our lexer is in a great state to start making our parser. In the next post we'll start generating our grammar for our parser.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 4 of this series.