Reconstructing Ruby, Part 3: More Tokens

Please check out Part 2 if you missed it!

In this post we're going to be adding a lot of tokens to our lexer. Most of these tokens will be rather straight-forward, but a few will be rather interesting. Let's start by making it easier for us to print tokens. Add the following C macros in the directives section of our Flex file right below #include <stdio.h>:

#define VTYPE(type, value) printf("%s(%s)\n", type, value)
#define TYPE(type) printf("%s\n", type)

We're using VTYPE to mean a token type that has a value (e.g. NUMBER(4), STRING("Hello Wombat"), ID(wombat)). A token using TYPE will not have an associated value (e.g. =). Let's update our number rule to make use of VTYPE:

[0-9]+ { VTYPE("NUMBER", yytext); }

Compiling and running our program should produce:

$ ./ruby program.rb 
NUMBER(4)  
NUMBER(8)  
NUMBER(15)  
NUMBER(16)  
NUMBER(23)  
NUMBER(42)  

Let's change our program.rb file to contain:

x = 5  

Running our program produces:

$ ./ruby program.rb 
Unknown token x  
Unknown token  
Unknown token =  
Unknown token  
NUMBER(5)  

Since spaces are a bit difficult to see, let's make one small modification to our Unknown token string:

. { fprintf(stderr, "Unknown token '%s'\n", yytext); }

Now when we run our program we get:

$ ./ruby program.rb 
Unknown token 'x'  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  
NUMBER(5)  

Let's fix each of these unknown tokens in order. First, in Ruby, x is what we call an identifier. An identifier is something Ruby needs to lookup in order to determine whether it's a local variable, a method, or if it's undefined. An identifier in ruby specifically starts with a lowercase letter or an underscore and contains lowercase and uppercase letters as well as underscores and numbers in the rest of the name. We'll write a rule to capture this:

[a-z_][a-zA-Z0-9_]* { VTYPE("ID", yytext); }

Running the program again we get:

$ ./ruby program.rb 
ID(x)  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  
NUMBER(5)  

We can handle whitespace with:

[\t ] {}

Lastly we'll add a rule for the =:

"=" { TYPE("EQUAL"); }  

Now we get:

$ ./ruby program.rb
ID(x)  
EQUAL  
NUMBER(5)  

Fantastic! Our lexer recognizes our entire program! Let's add some additional TYPE token rules:

">" { TYPE("GT"); }  
"<" { TYPE("LT"); }  
">=" { TYPE("GTE"); }  
"<=" { TYPE("LTE"); }  
"!=" { TYPE("NEQUAL"); }  
"+" { TYPE("PLUS"); }  
"-" { TYPE("MINUS"); }  
"*" { TYPE("MULT"); }  
"/" { TYPE("DIV"); }  
"%" { TYPE("MOD"); }  
"!" { TYPE("EMARK"); }  
"?" { TYPE("QMARK"); }  
"&" { TYPE("AND"); }  
"|" { TYPE("OR"); }  
"[" { TYPE("LSBRACE"); }  
"]" { TYPE("RSBRACE"); }  
"(" { TYPE("LPAREN"); }  
")" { TYPE("RPAREN"); }  
"{" { TYPE("LBRACE"); }  
"}" { TYPE("RBRACE"); }  
"@" { TYPE("AT"); }  
"." { TYPE("DOT"); }  
"," { TYPE("COMMA"); }  
":" { TYPE("COLON"); }  

Now that we have a bunch ot TYPE tokens, let's write some VTYPE token rules. We'll start with floats. Floats are interesting because they have multiple forms. You can see here how Ruby determines if a number is a float (pardon the C-explosion):

for (;;) {  
  switch (c) {
    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
      nondigit = 0;
      tokadd(c);
      break;

    case '.':
      if (nondigit) goto trailing_uc;
      if (seen_point || seen_e) {
        goto decode_num;
      } else {
        int c0 = nextc();
        if (c0 == -1 || !ISDIGIT(c0)) {
          pushback(c0);
          goto decode_num;
        }
        c = c0;
      }
      seen_point = toklen();
      tokadd('.');
      tokadd(c);
      is_float++;
      nondigit = 0;
      break;

    case 'e':
    case 'E':
      if (nondigit) {
        pushback(c);
        c = nondigit;
        goto decode_num;
      }
      if (seen_e) {
        goto decode_num;
      }
      nondigit = c;
      c = nextc();
      if (c != '-' && c != '+' && !ISDIGIT(c)) {
        pushback(c);
        nondigit = 0;
        goto decode_num;
      }
      tokadd(nondigit);
      seen_e++;
      is_float++;
      tokadd(c);
      nondigit = (c == '-' || c == '+') ? c : 0;
      break;

    case '_':   /* `_' in number just ignored */
      if (nondigit) goto decode_num;
      nondigit = c;
      break;

    default:
      goto decode_num;

There are a couple of interesting things happening in that function. First off if we see a . or an e or E we know we're talking about a float. Ruby designates this by is_float++ (since 0 is false in C). If Ruby sees an e or E it reads the next character to see if it's a + or - or a digit. If it's none of those we decode the number (which will generate an error at that point since the e-form needs a number afterwards.

There's a case we're not handling for regular numbers. In Ruby, a number can have _ in it. Ruby ignores this underscore when parsing numbers (but only a single _; while 1_2 is a valid number 1__2 is not). Before we get to floats, let's modify our definition of number to handle the case with _'s. If we change program.rb to say:

x = 5_1  

Then our output is:

$ ./ruby program.rb 
ID(x)  
EQUAL  
NUMBER(5)  
ID(_1)  

We can fix this by modifying our rule for NUMBER:

[0-9](_[0-9]|[0-9])* { VTYPE("NUMBER", yytext); }

Now we'll get the output:

$ ./ruby program.rb
ID(x)  
EQUAL  
NUMBER(5_1)  

Great! It would also be great if we could get rid of the underscore in our number, but we'll handle that in the future.

Moving forward, having a way to reference our NUMBER pattern will be useful. We'll want to make use of it in our FLOAT rule. So we're going to pull out the pattern into a definition. Definitions are reusable patterns that we can reference in Flex by wrapping them like '{DEFINITION}'. Definitions go at the top of our Flex file along with any directives. So right below:

%option noyywrap 

we'll add:

NUMBER [0-9](_[0-9]|[0-9])*  

So here we've created a definition with the name NUMBER that we can reference in our rules. Let's change our current number rule to use the definition we wrote:

{NUMBER} { VTYPE("NUMBER", yytext); }

Now that we've pulled out a definition for numbers, we can use it to write a rule for handling floats. First, let's add all the forms of floats we can come up with to our program.rb:

1.0  
1.01  
1_0.0  
1_0.0_1  
1e1  
1.0e1  
1.0e+1  
1.0e-1  
1.0E1  
1.0E+1  
1.0E-1  
1_1_1.0_1_4E+1_4_0  
10  
1_0  

This should indicate fairly well if our float rule covers all of our cases. A couple of regular numbers were added to make sure we're still capturing regular numbers correctly. Let's add the following rule right above our number rule:

{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) { VTYPE("FLOAT", yytext); }

Let's run that against our program:

$ ./ruby program.rb
FLOAT(1.0)  
FLOAT(1.01)  
FLOAT(1_0.0)  
FLOAT(1_0.0_1)  
FLOAT(1e1)  
FLOAT(1.0e1)  
FLOAT(1.0e+1)  
FLOAT(1.0e-1)  
FLOAT(1.0E1)  
FLOAT(1.0E+1)  
FLOAT(1.0E-1)  
FLOAT(1_1_1.0_1_4E+1_4_0)  
NUMBER(10)  
NUMBER(1_0)  

Great! It looks like everything is in order for floats. Now that we have floats working, we can add the rules for strings. Let's start with the rule:

\"[^"]*\" { VTYPE("STRING", yytext); }

This rule says we're looking for a double quote followed by zero or more characters that are not a double quote followed by a double quote. That will work for strings like "Hello World", but what about a string like "Hello \"World\"". The general rule about using \ as an escape is that any character can follow it. So let's modify our rule just a little to reflect this:

\"([^"]|\\.)*\" { VTYPE("STRING", yytext); }

Now we're saying that a string contains any character that is not a double quote, or any character preceeded by a \. The last thing we'll need to do is repeat the same thing for single quotes.

\'([^']|\\.)*\' { VTYPE("STRING", yytext); }

In the future, we'll need to take into account the fact that single quoted strings are not interpolated and double quoted are. Additionally there's a bunch of string madness around %, but we're going to ignore that for now.

Let's move on to constants. Constants are just like our ids except they'll start with a capital letter:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); }

Alright, now that we have our rules, let's try them out against a large ruby program. Let's change our program.rb to the following:

class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

g = Greeter.new("world")

g.salute  

It we run our lexer against it we get the following:

./ruby program.rb 
ID(class)  
CONSTANT(Greeter)  
ID(def)  
ID(initialize)  
LPAREN  
ID(name)  
RPAREN  
AT  
ID(name)  
EQUAL  
ID(name)  
DOT  
ID(capitalize)  
ID(end)  
ID(def)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(end)  
ID(end)  
ID(g)  
EQUAL  
CONSTANT(Greeter)  
DOT  
ID(new)  
LPAREN  
STRING("world")  
RPAREN  
ID(g)  
DOT  
ID(salute)  

Let's compare that to ripper's output:

irb  
> require 'pp'
=> true
> require 'ripper'
=> true
> pp Ripper.lex(File.read("program.rb"))
[[[1, 0], :on_kw, "class"],
 [[1, 5], :on_sp, " "],
 [[1, 6], :on_const, "Greeter"],
 [[1, 13], :on_nl, "\n"],
 [[2, 0], :on_sp, "  "],
 [[2, 2], :on_kw, "def"],
 [[2, 5], :on_sp, " "],
 [[2, 6], :on_ident, "initialize"],
 [[2, 16], :on_lparen, "("],
 [[2, 17], :on_ident, "name"],
 [[2, 21], :on_rparen, ")"],
 [[2, 22], :on_ignored_nl, "\n"],
 [[3, 0], :on_sp, "    "],
 [[3, 4], :on_ivar, "@name"],
 [[3, 9], :on_sp, " "],
 [[3, 10], :on_op, "="],
 [[3, 11], :on_sp, " "],
 [[3, 12], :on_ident, "name"],
 [[3, 16], :on_period, "."],
 [[3, 17], :on_ident, "capitalize"],
 [[3, 27], :on_nl, "\n"],
 [[4, 0], :on_sp, "  "],
 [[4, 2], :on_kw, "end"],
 [[4, 5], :on_nl, "\n"],
 [[5, 0], :on_ignored_nl, "\n"],
 [[6, 0], :on_sp, "  "],
 [[6, 2], :on_kw, "def"],
 [[6, 5], :on_sp, " "],
 [[6, 6], :on_ident, "salute"],
 [[6, 12], :on_nl, "\n"],
 [[7, 0], :on_sp, "    "],
 [[7, 4], :on_ident, "puts"],
 [[7, 8], :on_sp, " "],
 [[7, 9], :on_tstring_beg, "\""],
 [[7, 10], :on_tstring_content, "Hello "],
 [[7, 16], :on_embexpr_beg, "\#{"],
 [[7, 18], :on_ivar, "@name"],
 [[7, 23], :on_embexpr_end, "}"],
 [[7, 24], :on_tstring_content, "!"],
 [[7, 25], :on_tstring_end, "\""],
 [[7, 26], :on_nl, "\n"],
 [[8, 0], :on_sp, "  "],
 [[8, 2], :on_kw, "end"],
 [[8, 5], :on_nl, "\n"],
 [[9, 0], :on_kw, "end"],
 [[9, 3], :on_nl, "\n"],
 [[10, 0], :on_ignored_nl, "\n"],
 [[11, 0], :on_ident, "g"],
 [[11, 1], :on_sp, " "],
 [[11, 2], :on_op, "="],
 [[11, 3], :on_sp, " "],
 [[11, 4], :on_const, "Greeter"],
 [[11, 11], :on_period, "."],
 [[11, 12], :on_ident, "new"],
 [[11, 15], :on_lparen, "("],
 [[11, 16], :on_tstring_beg, "\""],
 [[11, 17], :on_tstring_content, "world"],
 [[11, 22], :on_tstring_end, "\""],
 [[11, 23], :on_rparen, ")"],
 [[11, 24], :on_nl, "\n"],
 [[12, 0], :on_ignored_nl, "\n"],
 [[13, 0], :on_ident, "g"],
 [[13, 1], :on_period, "."],
 [[13, 2], :on_ident, "salute"],
 [[13, 8], :on_nl, "\n"]]

Ripper is part of Ruby's stdlib, or standard library, and it provides a lot of insight into how ruby processes our code. While there are some large differences in how strings are being handled, keywords, and ivars, we'll cross those bridges when we get to them.

One last rule I'd like to add is for handling comments. Let's add this to the top of our rules:

#.*$ {}

Now comments will be ignored.

For here, our lexer is in a great state to start making our parser. In the next post we'll start generating our grammar for our parser.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Reconstructing Ruby, Part 2: Files and Errors

Please read Part 1 of this series if you haven't already.

Before we continue, it's worth mentioning that while we're writing a lexer using Flex, Ruby doesn't use Flex. Instead Ruby implements a giant function for handling all of its tokenizing. I've chosen to use Flex because I find it to be more approachable than implementing a lexer from scratch. Using Flex also requires writing far less code. If possible, I'd like to avoid writing a 1265 line long method. Also, it would take a large chunk of time we can spend implementing more interesting concepts.

There are some likely reasons why Ruby didn't use a tool like Flex and chose to implement its own lexer. Lexers are not terribly difficult to implement and using a custom lexer allows for possible optimizations that you would not be able to achieve with a general solution like Flex. Since I'm not worried about the marginal increase we'll continue to use Flex.

When we left off, our program was able to read content from STDIN and identify numbers. We're now going to allow our program to also accept input in the form of a file. If we had an input file named program.rb that looked like:

4  
8  
15  
16  
23  
42  

Then we want to be able to do this:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

Changing Flex to use a file as input is surprisingly easy. Flex reads its input from a global variable called yyin. Initially yyin is set to read from STDIN. All we need to do is change it to point to our file. Let's update our main function to look like this:

int main(int argc, char *argv[]) {  
  if (argc > 1) {
    yyin = fopen(argv[1], "r");
  }

  yylex();

  return EXIT_SUCCESS;
}

In the code above, we're checking if any arguments were provided. If an argument was provided, we'll treat that as a filename, open that file for reading, and point yyin to it. If we run make and ./ruby program.rb we will see the expected output!

Notice, however, that our program prints a newline between each line of output:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

It might not be obvious where these newlines are coming from, but let's take a look at the patterns page of Flex again. If you look at what . matches you'll notice

‘.’ any character (byte) except newline

Each number in program.rb is followed by a newline. Flex isn't matching the newline, so it gets printed to the screen. Let's modify our rules slightly to handle this situation. We'll add the following rule above the . rule:

\n {}

This rule will match newlines and then do nothing with them. Let's make again and run ./ruby program.rb. Voila! Much better!

$ ./ruby program.rb
NUMBER: 4  
NUMBER: 8  
NUMBER: 15  
NUMBER: 16  
NUMBER: 23  
NUMBER: 42  

Currently, we're ignoring invalid tokens. Future work will be easier if we print them out with a message instead. Let's change our . rule from:

. {}

to:

. { fprintf(stderr, "Unknown token %s\n", yytext); }

Now, if we had an invalid character we don't have a rule for, we'll see an error message like:

Unknown token #  

Our program currently contains both C code and Flex directives. The Flex directives define the lexer, and the C code instantiates it and runs our program. Those seem like separate concerns, so let's split this code into two separate files. First, let's change our Makefile to add a dependency for main.c:

all: ruby                   

ruby: main.c lex.yy.c  
    cc -o ruby main.c lex.yy.c

lex.yy.c: ruby.l  
    flex ruby.l

clean:  
    rm -rf lex.yy.c ruby

Next, we'll create our main.c file:

#include <stdio.h>                
#include <stdlib.h>               

int main(int argc, char *argv[]) {  
  if (argc > 1) {                 
    yyin = fopen(argv[1], "r");   
  }                               
  yylex();                        

  return EXIT_SUCCESS;            
}                                 

Your ruby.l should now look like this:

%{                                        
  #include <stdio.h>
%}

%option noyywrap

%%

[0-9]+ { printf("NUMBER: %s\n", yytext); }
\n {}
. { fprintf(stderr, "Unknown token %s\n", yytext); }

Make sure to remove stdlib.h from the includes because it's no longer a dependency in ruby.l. You can also remove the last %% in ruby.l because the user-code section is gone.

It might look like we're all set, but when we try and run make we hit a small snag:

$ make
flex ruby.l  
cc -o ruby main.c lex.yy.c  
main.c:6:5: error: use of undeclared identifier 'yyin'  
    yyin = fopen(argv[1], "r");
    ^
main.c:8:3: warning: implicit declaration of function 'yylex' is invalid in C99 [-Wimplicit-function-declaration]  
  yylex();
  ^

We're seeing these warnings because yyin and yylex are defined inside of lex.yy.c. Our main.c doesn't know anything about them.

We can easily fix this by using the extern keyword from C. Above our main function we'll add the following:

extern FILE* yyin;  
extern int yylex(void);  

extern lets the compiler know that it doesn't need to throw any warnings because it can expect these identifiers to be defined in another file. If we run make again everything compiles cleanly.

In the next post we'll start adding more rules to cover the rest of Ruby's syntax.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left it below!

Update: Read Part 3 of this series.

Reconstructing Ruby, Part 1: Our First Lexer

Please read my introduction to this series to find out more about it.

The first step in writing our implementation of Ruby is to create a lexer. What is a lexer? The job of a lexer is to read your code and identify what we call tokens. Tokens are simply named pieces of text. An example of this would be to take the program:

class Wombat  
  def run
    puts "The wombat is running"
  end
end  

and breaking it down into the following:

keyword("class")  
constant("Wombat")  
newline  
keyword("def")  
identifier("run")  
newline  
identifier("puts")  
string("The wombat is running")  
newline  
keyword("end")  
newline  
keyword("end")  
newline  

We’ll be using a program called Flex to tokenize our code. Flex is a lexical analyzer designed to create a lexer. The lexer we will create will be relatively simple at first and will only identify numbers.

Typically, a Flex file will have this structure:

definitions and directives  
%%
rules  
%%
user code  

The %% are part of Flex's syntax and separate the sections that we'll define.

Since we're writing a very simple lexer, we’ll ignore adding any definitions for now. However, we’ll be adding a couple of directives. Directives are specific instructions we'll give the Flex compiler. The first directive allows us to execute code before the lexer. This will allow us to load up some necessary C includes like stdio.h. This code is copied to the beginning of our generated lexer. It looks like:

%{
  #include <stdlib.h>
  #include <stdio.h>
%}

The second directive we’ll add tells Flex that we will not be implementing a yywrap method. When the lexer gets to the end of its input, it runs yywrap in order to check if there is an additional input file to continue reading tokens from. If yywrap returns 0, that indicates that yyin has been setup to point to a new set of input. If yywrap returns 1 then we're telling flex that there is no additional input. Because we're only reading from a single input source, we can use the noyywrap option to have flex generate a version of that method that will always return 1:

%option noyywrap

After specifying the directives, we’ll follow up with some rules. Rules are essentially a pattern and an associated action. When Flex is parsing text, it will try to match that text against each of the patterns. When it finds a match, it will execute the corresponding action. You can find out more about what patterns Flex supports by visiting the patterns page.

The format for a single rule would look like:

PATTERN { ACTION }  

Since we're only going recognize numbers with our lexer we’ll want to ignore any characters that aren’t numbers. To do this, we’ll need two rules:

[0-9]+ { printf("NUMBER: %s\n", yytext); }
. {}

The first rule references yytext. yytext is a character array that contains the text that Flex has matched. In our case it will be a character array containing some number. Each rule uses regular-expression-like syntax for describing what it will match. For instance, our first rule matches any number between 0 and 9 one or more times. The second rule matches any single character and executes no code on a match. It's important to note here that the patterns will be matched in order which is why our second rule won't capture any numbers.

The last part of a Flex file is the user code. We’ll add a simple C main function to our file. This main function will call yylex which triggers the initial Flex process based on what it reads from STDIN.

You may have noticed the extensive use of yy-something in Flex. The reason behind this is that Flex was designed to be an open source alternative to Lex (which at the time was proprietary software). Lex interfaced with a program called Yacc which had namespaced everything with yy.

Let’s create a file called ruby.l and add the following content:

%{
  #include <stdlib.h>
  #include <stdio.h>
%}

%option noyywrap

%%
[0-9]+ { printf("NUMBER: %s\n", yytext); }
. {}
%%

int main(int argc, char *argv[]) {  
  yylex();
  return EXIT_SUCCESS;
}

Now to compile and run this do the following:

$ flex ruby.l

This will generate a single C source file called lex.yy.c. Flex took our lexer definition and compiled it into C code, which we can then compile into an executable program:

$ cc -o ruby lex.yy.c

Now we have an executable name ruby. Let's run it:

$ ./ruby

Then try the following, type 1 and press enter, then press 42 and press enter. You should see the following:

1  
NUMBER: 1  
42  
NUMBER: 42  

If you type things that aren't numbers, they'll just be ignored. You can exit our program by pressing ctrl-d. ctrl-d flushes input on STDIN causing our program to acknowledge that there is no more content to read and thus it terminates.

Huzzah! Our initial lexer is complete! While it doesn't do too much yet, it does provide us with the foundation for our next steps. In the future, we'll add more rules that will match all of the tokens Ruby has to offer.

Before we go any further, let’s automate the compilation process of our Ruby by using the make program. make allows us to define tasks, which we can use to compile our code, in a file named Makefile. Create a file named Makefile, with a capital M, and add the following:

all: ruby

ruby: lex.yy.c  
    cc -o ruby lex.yy.c

lex.yy.c: ruby.l  
    flex ruby.l

clean:  
    rm ruby lex.yy.c

All indentation in the Makefile is supposed to be a tab. Make sure you've typed a tab or the make program won't work. You'll most likely see *** missing separator. Stop. if you've used spaces instead of a tab.

Now we can just run make when we want to compile our version of ruby. To verify this, run make clean followed by make. We’ll modify this file as our implementation of ruby grows.

If you're having any troubles with the code you can check out the reference implementation on GitHub. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Of course, in Ruby we can evaluate files as well as reading from STDIN. In the next post we'll modify our lexer to also take files as input.

Update: Read Part 2 of this series.