Reconstructing Ruby, Part 2: Files and Errors

Please read Part 1 of this series if you haven't already.

Before we continue, it's worth mentioning that while we're writing a lexer using Flex, Ruby doesn't use Flex. Instead Ruby implements a giant function for handling all of its tokenizing. I've chosen to use Flex because I find it to be more approachable than implementing a lexer from scratch. Using Flex also requires writing far less code. If possible, I'd like to avoid writing a 1265 line long method. Also, it would take a large chunk of time we can spend implementing more interesting concepts.

There are some likely reasons why Ruby didn't use a tool like Flex and chose to implement its own lexer. Lexers are not terribly difficult to implement and using a custom lexer allows for possible optimizations that you would not be able to achieve with a general solution like Flex. Since I'm not worried about the marginal increase we'll continue to use Flex.

When we left off, our program was able to read content from STDIN and identify numbers. We're now going to allow our program to also accept input in the form of a file. If we had an input file named program.rb that looked like:

4  
8  
15  
16  
23  
42  

Then we want to be able to do this:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

Changing Flex to use a file as input is surprisingly easy. Flex reads its input from a global variable called yyin. Initially yyin is set to read from STDIN. All we need to do is change it to point to our file. Let's update our main function to look like this:

int main(int argc, char *argv[]) {  
  if (argc > 1) {
    yyin = fopen(argv[1], "r");
  }

  yylex();

  return EXIT_SUCCESS;
}

In the code above, we're checking if any arguments were provided. If an argument was provided, we'll treat that as a filename, open that file for reading, and point yyin to it. If we run make and ./ruby program.rb we will see the expected output!

Notice, however, that our program prints a newline between each line of output:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

It might not be obvious where these newlines are coming from, but let's take a look at the patterns page of Flex again. If you look at what . matches you'll notice

‘.’ any character (byte) except newline

Each number in program.rb is followed by a newline. Flex isn't matching the newline, so it gets printed to the screen. Let's modify our rules slightly to handle this situation. We'll add the following rule above the . rule:

\n {}

This rule will match newlines and then do nothing with them. Let's make again and run ./ruby program.rb. Voila! Much better!

$ ./ruby program.rb
NUMBER: 4  
NUMBER: 8  
NUMBER: 15  
NUMBER: 16  
NUMBER: 23  
NUMBER: 42  

Currently, we're ignoring invalid tokens. Future work will be easier if we print them out with a message instead. Let's change our . rule from:

. {}

to:

. { fprintf(stderr, "Unknown token %s\n", yytext); }

Now, if we had an invalid character we don't have a rule for, we'll see an error message like:

Unknown token #  

Our program currently contains both C code and Flex directives. The Flex directives define the lexer, and the C code instantiates it and runs our program. Those seem like separate concerns, so let's split this code into two separate files. First, let's change our Makefile to add a dependency for main.c:

all: ruby                   

ruby: main.c lex.yy.c  
    cc -o ruby main.c lex.yy.c

lex.yy.c: ruby.l  
    flex ruby.l

clean:  
    rm -rf lex.yy.c ruby

Next, we'll create our main.c file:

#include <stdio.h>                
#include <stdlib.h>               

int main(int argc, char *argv[]) {  
  if (argc > 1) {                 
    yyin = fopen(argv[1], "r");   
  }                               
  yylex();                        

  return EXIT_SUCCESS;            
}                                 

Your ruby.l should now look like this:

%{                                        
  #include <stdio.h>
%}

%option noyywrap

%%

[0-9]+ { printf("NUMBER: %s\n", yytext); }
\n {}
. { fprintf(stderr, "Unknown token %s\n", yytext); }

Make sure to remove stdlib.h from the includes because it's no longer a dependency in ruby.l. You can also remove the last %% in ruby.l because the user-code section is gone.

It might look like we're all set, but when we try and run make we hit a small snag:

$ make
flex ruby.l  
cc -o ruby main.c lex.yy.c  
main.c:6:5: error: use of undeclared identifier 'yyin'  
    yyin = fopen(argv[1], "r");
    ^
main.c:8:3: warning: implicit declaration of function 'yylex' is invalid in C99 [-Wimplicit-function-declaration]  
  yylex();
  ^

We're seeing these warnings because yyin and yylex are defined inside of lex.yy.c. Our main.c doesn't know anything about them.

We can easily fix this by using the extern keyword from C. Above our main function we'll add the following:

extern FILE* yyin;  
extern int yylex(void);  

extern lets the compiler know that it doesn't need to throw any warnings because it can expect these identifiers to be defined in another file. If we run make again everything compiles cleanly.

In the next post we'll start adding more rules to cover the rest of Ruby's syntax.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left it below!

Reconstructing Ruby, Part 1: Our First Lexer

Please read my introduction to this series to find out more about it.

The first step in writing our implementation of Ruby is to create a lexer. What is a lexer? The job of a lexer is to read your code and identify what we call tokens. Tokens are simply named pieces of text. An example of this would be to take the program:

class Wombat  
  def run
    puts "The wombat is running"
  end
end  

and breaking it down into the following:

keyword("class")  
constant("Wombat")  
newline  
keyword("def")  
identifier("run")  
newline  
identifier("puts")  
string("The wombat is running")  
newline  
keyword("end")  
newline  
keyword("end")  
newline  

We’ll be using a program called Flex to tokenize our code. Flex is a lexical analyzer designed to create a lexer. The lexer we will create will be relatively simple at first and will only identify numbers.

Typically, a Flex file will have this structure:

definitions and directives  
%%
rules  
%%
user code  

The %% are part of Flex's syntax and separate the sections that we'll define.

Since we're writing a very simple lexer, we’ll ignore adding any definitions for now. However, we’ll be adding a couple of directives. Directives are specific instructions we'll give the Flex compiler. The first directive allows us to execute code before the lexer. This will allow us to load up some necessary C includes like stdio.h. This code is copied to the beginning of our generated lexer. It looks like:

%{
  #include <stdlib.h>
  #include <stdio.h>
%}

The second directive we’ll add tells Flex that we will not be implementing a yywrap method. When the lexer gets to the end of its input, it runs yywrap in order to check if there is an additional input file to continue reading tokens from. If yywrap returns 0, that indicates that yyin has been setup to point to a new set of input. If yywrap returns 1 then we're telling flex that there is no additional input. Because we're only reading from a single input source, we can use the noyywrap option to have flex generate a version of that method that will always return 1:

%option noyywrap

After specifying the directives, we’ll follow up with some rules. Rules are essentially a pattern and an associated action. When Flex is parsing text, it will try to match that text against each of the patterns. When it finds a match, it will execute the corresponding action. You can find out more about what patterns Flex supports by visiting the patterns page.

The format for a single rule would look like:

PATTERN { ACTION }  

Since we're only going recognize numbers with our lexer we’ll want to ignore any characters that aren’t numbers. To do this, we’ll need two rules:

[0-9]+ { printf("NUMBER: %s\n", yytext); }
. {}

The first rule references yytext. yytext is a character array that contains the text that Flex has matched. In our case it will be a character array containing some number. Each rule uses regular-expression-like syntax for describing what it will match. For instance, our first rule matches any number between 0 and 9 one or more times. The second rule matches any single character and executes no code on a match. It's important to note here that the patterns will be matched in order which is why our second rule won't capture any numbers.

The last part of a Flex file is the user code. We’ll add a simple C main function to our file. This main function will call yylex which triggers the initial Flex process based on what it reads from STDIN.

You may have noticed the extensive use of yy-something in Flex. The reason behind this is that Flex was designed to be an open source alternative to Lex (which at the time was proprietary software). Lex interfaced with a program called Yacc which had namespaced everything with yy.

Let’s create a file called ruby.l and add the following content:

%{
  #include <stdlib.h>
  #include <stdio.h>
%}

%option noyywrap

%%
[0-9]+ { printf("NUMBER: %s\n", yytext); }
. {}
%%

int main(int argc, char *argv[]) {  
  yylex();
  return EXIT_SUCCESS;
}

Now to compile and run this do the following:

$ flex ruby.l

This will generate a single C source file called lex.yy.c. Flex took our lexer definition and compiled it into C code, which we can then compile into an executable program:

$ cc -o ruby lex.yy.c

Now we have an executable name ruby. Let's run it:

$ ./ruby

Then try the following, type 1 and press enter, then press 42 and press enter. You should see the following:

1  
NUMBER: 1  
42  
NUMBER: 42  

If you type things that aren't numbers, they'll just be ignored. You can exit our program by pressing ctrl-d. ctrl-d flushes input on STDIN causing our program to acknowledge that there is no more content to read and thus it terminates.

Huzzah! Our initial lexer is complete! While it doesn't do too much yet, it does provide us with the foundation for our next steps. In the future, we'll add more rules that will match all of the tokens Ruby has to offer.

Before we go any further, let’s automate the compilation process of our Ruby by using the make program. make allows us to define tasks, which we can use to compile our code, in a file named Makefile. Create a file named Makefile, with a capital M, and add the following:

all: ruby

ruby: lex.yy.c  
    cc -o ruby lex.yy.c

lex.yy.c: ruby.l  
    flex ruby.l

clean:  
    rm ruby lex.yy.c

All indentation in the Makefile is supposed to be a tab. Make sure you've typed a tab or the make program won't work. You'll most likely see *** missing separator. Stop. if you've used spaces instead of a tab.

Now we can just run make when we want to compile our version of ruby. To verify this, run make clean followed by make. We’ll modify this file as our implementation of ruby grows.

If you're having any troubles with the code you can check out the reference implementation on GitHub. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Of course, in Ruby we can evaluate files as well as reading from STDIN. In the next post we'll modify our lexer to also take files as input.

Update: Read Part 2 of this series.

Reconstructing Ruby: An Introduction

Over the past few years, I've been interested in the design of programming languages. I paired up with Caleb Thompson on a very opinionated programming language not too long ago. We wrote the whole language in Ruby and it was mostly an excercise in design and understanding how languages are constructed. We called the language "The Esoteric Order of Dagon", or "Dagon" for short, and you can see our work on GitHub.

The culmination of that project and reading Ruby Under a Microscope paved the way for me wanting to try to implement Ruby, from scratch, using C. This presented some unique challenges as I don't have an extensive background in C, but what better way to learn it than to challenge myself with something exceedingly harder than my usual toy applications.

I initially started this process by trying to write a book, but I had never done that either. So on top of trying to learn C and the art of writing a programming language, I was also trying to learn LaTeX and how to structure a technical book. This was a huge mistake, nothing slows your progress more than having a ton of things you have to learn all at once.

I've decided to slow my ambitions and focus on the important part of my adventure. Instead of writing a book, I've decided to break down the chapters I've already written into a series of blog posts describing the steps I've taken, but at the same time opening my work up to the community to get feedback and ideas. Afterwards I'll continue my work until I have a bare-bones implementation of Ruby's core (probably sans Threads, but who knows).

My plan is to release the first post today and follow up with a new post every few weeks for as long as I and the community remains interested.

Fair warning: I'm not an artist so it's likely I won't have many images to clarify things.

Update: Read Part 1 Our First Lexer

Error-Driven Development

UPDATE: Here is a blog post where one of the metis students talks about their experiences with Error-Driven Development

I'm currently teaching the second cohort at metis, a Ruby on Rails bootcamp, and a technique I've been using to teach students Rails is something I've dubbed Error Driven Development.

The main idea behind Error-Driven Development is that since Rails provides such wonderful error pages, we can teach students how to use Rails by getting them to generate the simplest error to push their applications forward. This allows Rails to act as a guiding force to help students understand what Rails does and what the next step is.

An example of this would be to create a page that shows a list of galleries. The first step is to have the students visit "/galleries" which throws a routing error. After adding the route, Rails will complain that no GalleriesController exists. After adding the controller, Rails will complain that there is no index action. Finally, after adding the action, Rails will complain that there is no index.html.erb template.

One of the main benefits I've found to using this technique, is that it closely mirrors the process of Test-Driven Development where the developer is trying to create a failing test case and then performs the minimal amount of work to make the test pass before moving on to the next failure. I've found that by teaching Rails in this way we get students already thinking in a test-first kind of mentality which allows them to more quickly pick up Test-Driven Development on their own outside of the bootcamp.

As an instructor, seeing this result has been extremely gratifying. I remember when I first started learning Test-Driven Development using cucumber around 5 years ago that I found it very difficult. It was hard to undo my process for writing software that I had created over the prior 12 years before I was introduced to Test-Driven Development. By introducing a very similar process for those first entering into the world of programming we created a kind of cushion for making the transition much easier than the one I faced.

Outside of learning how to program, I wouldn't suggest Error-Driven Development as a means of writing actual software. I'd much rather write software via Test-Driven Development, since it will provide the benefit of possibly preventing the breaking of my application, but I've definitely added it to my toolbox for teaching others how to write software.