Reconstructing Ruby, Part 2: Files and Errors

Please read Part 1 of this series if you haven't already.

Before we continue, it's worth mentioning that while we're writing a lexer using Flex, Ruby doesn't use Flex. Instead Ruby implements a giant function for handling all of its tokenizing. I've chosen to use Flex because I find it to be more approachable than implementing a lexer from scratch. Using Flex also requires writing far less code. If possible, I'd like to avoid writing a 1265 line long method. Also, it would take a large chunk of time we can spend implementing more interesting concepts.

There are some likely reasons why Ruby didn't use a tool like Flex and chose to implement its own lexer. Lexers are not terribly difficult to implement and using a custom lexer allows for possible optimizations that you would not be able to achieve with a general solution like Flex. Since I'm not worried about the marginal increase we'll continue to use Flex.

When we left off, our program was able to read content from STDIN and identify numbers. We're now going to allow our program to also accept input in the form of a file. If we had an input file named program.rb that looked like:

4  
8  
15  
16  
23  
42  

Then we want to be able to do this:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

Changing Flex to use a file as input is surprisingly easy. Flex reads its input from a global variable called yyin. Initially yyin is set to read from STDIN. All we need to do is change it to point to our file. Let's update our main function to look like this:

int main(int argc, char *argv[]) {  
  if (argc > 1) {
    yyin = fopen(argv[1], "r");
  }

  yylex();

  return EXIT_SUCCESS;
}

In the code above, we're checking if any arguments were provided. If an argument was provided, we'll treat that as a filename, open that file for reading, and point yyin to it. If we run make and ./ruby program.rb we will see the expected output!

Notice, however, that our program prints a newline between each line of output:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

It might not be obvious where these newlines are coming from, but let's take a look at the patterns page of Flex again. If you look at what . matches you'll notice

‘.’ any character (byte) except newline

Each number in program.rb is followed by a newline. Flex isn't matching the newline, so it gets printed to the screen. Let's modify our rules slightly to handle this situation. We'll add the following rule above the . rule:

\n {}

This rule will match newlines and then do nothing with them. Let's make again and run ./ruby program.rb. Voila! Much better!

$ ./ruby program.rb
NUMBER: 4  
NUMBER: 8  
NUMBER: 15  
NUMBER: 16  
NUMBER: 23  
NUMBER: 42  

Currently, we're ignoring invalid tokens. Future work will be easier if we print them out with a message instead. Let's change our . rule from:

. {}

to:

. { fprintf(stderr, "Unknown token %s\n", yytext); }

Now, if we had an invalid character we don't have a rule for, we'll see an error message like:

Unknown token #  

Our program currently contains both C code and Flex directives. The Flex directives define the lexer, and the C code instantiates it and runs our program. Those seem like separate concerns, so let's split this code into two separate files. First, let's change our Makefile to add a dependency for main.c:

all: ruby                   

ruby: main.c lex.yy.c  
    cc -o ruby main.c lex.yy.c

lex.yy.c: ruby.l  
    flex ruby.l

clean:  
    rm -rf lex.yy.c ruby

Next, we'll create our main.c file:

#include <stdio.h>                
#include <stdlib.h>               

int main(int argc, char *argv[]) {  
  if (argc > 1) {                 
    yyin = fopen(argv[1], "r");   
  }                               
  yylex();                        

  return EXIT_SUCCESS;            
}                                 

Your ruby.l should now look like this:

%{                                        
  #include <stdio.h>
%}

%option noyywrap

%%

[0-9]+ { printf("NUMBER: %s\n", yytext); }
\n {}
. { fprintf(stderr, "Unknown token %s\n", yytext); }

Make sure to remove stdlib.h from the includes because it's no longer a dependency in ruby.l. You can also remove the last %% in ruby.l because the user-code section is gone.

It might look like we're all set, but when we try and run make we hit a small snag:

$ make
flex ruby.l  
cc -o ruby main.c lex.yy.c  
main.c:6:5: error: use of undeclared identifier 'yyin'  
    yyin = fopen(argv[1], "r");
    ^
main.c:8:3: warning: implicit declaration of function 'yylex' is invalid in C99 [-Wimplicit-function-declaration]  
  yylex();
  ^

We're seeing these warnings because yyin and yylex are defined inside of lex.yy.c. Our main.c doesn't know anything about them.

We can easily fix this by using the extern keyword from C. Above our main function we'll add the following:

extern FILE* yyin;  
extern int yylex(void);  

extern lets the compiler know that it doesn't need to throw any warnings because it can expect these identifiers to be defined in another file. If we run make again everything compiles cleanly.

In the next post we'll start adding more rules to cover the rest of Ruby's syntax.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left it below!

Update: Read Part 3 of this series.