Reconstructing Ruby, Part 4: Our First Parser

Read Part 3 in case you missed it.

Now that our lexer is in a pretty good situation, we can start working on our parser. We'll be using Bison. Bison works very similar to Flex, but instead of generating a lexer, it will generate a parser.

Bison is an improvement on the program Yacc. It's an LALR parser generator. LALR stands for Look-Ahead Left-to-right, Rightmost derivation parser, which is a mouthful. I'm not going to even try to explain this.

In programming, a grammar refers to what tokens are allowed to be next to each other, as well as what tokens, when next to each other, represent a higher level concept.

First let's describe a program. In Ruby, a program is just a list of expressions. In programming terms, an expression is a piece of code that returns a value. In Ruby pretty much everything will return a value so we can consider everything an expression.

Other languages have an alternative to expressions called statements. Statements are like expresions but do not have a return value. Ruby technically has no statements as everything returns a value.

To start our grammar, we'll say a program is just a list of expressions:

program: expressions  

program and expressions are both known as non-terminals. You can think of a non-terminal as "not a token". The tokens on the other hand we'll refer to as terminals.

The main idea here is that every non-terminal must eventually resolve to a terminal. To some degree that makes sense because our program is just built up of tokens so we eventually want to get to a point to where we're talking about those tokens. One way to generalize a parser is to say it takes an array of tokens and from them builds a tree like structure known as an AST or Abstract Syntax Tree. You can find out more about ASTs on this Wikipedia page.

An interesting part of definining a grammar is dealing with recursion. For instance we've described a program as a list of expressions, but what is a list of expressions? The way we'd describe this in Bison would look something like this:

expressions: expressions expression  
           | expression

The | is kind of like an or. So in this case expressions are either an expression or what was previously determined to be expressions followed by another expression. The way this get's simplified is like this. Imagine we have 3 expressions following each other (I've added numbers to the end for clarity):

expression#1 expression#2 expression#3

When we parse this we're always trying to work our way to our starting non-terminal program, so we simplify the first expression to expressions

expressions expression#2 expression#3

Since an acceptable simplification of expressions would be expressions expression we can collapse them together:

expressions expression#3

Then we have a similar situation where we can simplify again to just:

expressions

Now since we can't simplify any futher expressions becomes program:

program

Bison files end with a .y extension due to Yacc. The format is very similar to our Flex file. The Bison manual defines it as:

%{
  Prologue
%}

Bison declarations

%%
Grammar rules  
%%

Epilogue  

To start our grammar, let's work downwards from the program definition. Make a file called parse.y and put the following code into it:

%{
    #include <stdio.h>
%}

%token tNUMBER
%start program

%%

program: expressions

expressions: expressions expression  
           | expression

expression: tNUMBER { printf("PARSED(%d)\n", $1); }  

Here we've defined a token tNUMBER which is a terminal. A common idiom is to prefix your tokens with the letter t. If a tNUMBER token is matched it will print $1 which is a reference back to tNUMBER because it's in the first position. If we had a larger simplification like expressions expression then expressions would be stored in $1 and expression in $2 due to their positions. We'll see more $ variables when our patterns get larger.

We've also defined the non-terminals program, expressions and expression. All of these non-terminals will eventually be expressible in tokens. A program contains expressions which is a set of expression which must be a tNUMBER.

Let's modify our Makefile to compile this now:

SRC=main.c parse.tab.c lex.yyc  
all: ruby                                     

ruby: ${SRC}  
> cc -o ruby ${SRC}

lex.yy.c: ruby.l  
> flex ruby.l

parse.tab.c: parse.y  
> bison -d parse.y

clean:  
> rm -rf ruby lex.yy.c parse.tab.c parse.tab.h

We've introduced SRC to make adding new files easier. Now as the list of files we need to compile grows we can specify them in one locations.

The -d flag for Bison produces the header file parse.tab.h. This header file will include definitions for tokens, like tNUMBER, which is useful for our lexer since all of the tokens will be defined in our parser.

If you run make you'll probably get this noise:

$ make
bison -d parse.y  
flex ruby.l  
cc -o ruby main.c lex.yy.c parse.tab.c  
parse.tab.c:1257:16: warning: implicit declaration of function 'yylex' is invalid in C99 [-Wimplicit-function-declaration]  
      yychar = YYLEX;
               ^
parse.tab.c:601:16: note: expanded from macro 'YYLEX'  
# define YYLEX yylex ()
               ^
parse.tab.c:1385:7: warning: implicit declaration of function 'yyerror' is invalid in C99 [-Wimplicit-function-declaration]  
      yyerror (YY_("syntax error"));
      ^
2 warnings generated.  
Undefined symbols for architecture x86_64:  
  "_yyerror", referenced from:
      _yyparse in parse-d9f59c.o
ld: symbol(s) not found for architecture x86_64  
clang: error: linker command failed with exit code 1 (use -v to see invocation)  
make: *** [ruby] Error 1  

The reason this is happening is because in order for Bison to work, it expects two functions to be defined - yylex and yyerror. We'll make use of extern again to reference yylex and we'll define a yyerror. At the top of our parse.y file add the following:

%{
  #include <stdio.h>
  extern int yylex(void);
  void yyerror(char const *s) { fprintf(stderr, "%s\n", s); }
%}

Now we'll need to have our main.c file call yyparse instead of yylex. At the top of our main.c add:

#include "parse.tab.h"

Remove:

extern int yylex(void)  

And replace the call to yylex with yyparse. If we change program.rb to contain just the number 1, compile and run our program, then we'll see:

$./ruby program.rb 
NUMBER(1)  
syntax error  

Oops, we'll need to change our lexer to return the correct token. Open ruby.l and replace:

{NUMBER} { VTYPE("NUMBER", yytext); }

with:

{NUMBER} { yylval = atoi(yytext); return tNUMBER; }

yylval holds the value of our token, which is currently the integer value of yytext thanks to atoi. Next we return the token type tNUMBER.

The last change we need to make to ruby.l is to add:

#include "parse.tab.h"

Right under:

#include <stdio.h>

This lets our lexer knows about the tokens we've defined in our parser. When we compile and run our program now we should see:

./ruby program.rb
PARSED(1)  

From this point we're ready to expand our grammar, define more tokens, and refine our lexer to utilize them. Let's start simple. Let's change program.rb to contain:

1 + 3  

And when we run the program we want the output to be 4. Running the program currently outputs:

PARSED(1)  
PLUS  
PARSED(3)  

Notice that while PLUS is printed it wasn't parsed. It was instead printed from our lexer. We'll fix that by adding a token for PLUS that we can return from our lexer. Add:

%token tPLUS

Below the tNUMBER token. Then change:

"+" { TYPE("PLUS"); }  

to:

"+" { return tPLUS; }  

If you run make and run the program now we'll end up with a syntax error. The reason for this is that we haven't described in our grammar a valid "sentence" with plus in it. So let's extend out expression grammer a little. We're going to change:

expression: tNUMBER { printf("PARSED(%d)\n", $1); }  

to:

expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }

Running the program now will produce the output of 4. The reason for this is because when the parser encounters a tNUMBER it will just return that value (which is currently an integer). So when it sees an expression of the form expression tPLUS expression that's the same as 1 tPLUS 3 which in our case $1 = 1 and $3 = 3. If our program.rb had 2 + 4 in it then $1 = 2 and $3 = 4 and the output would be 6.

At this point we're ready to extend out grammar to cover more of the ruby language. In the next post We'll define som more complex expressions in our grammar so that:

# The Greeter class
class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

# Create a new object
g = Greeter.new("world")

# Output "Hello World!"
g.salute  

Will be considered a valid program by our language.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Reconstructing Ruby, Part 3: More Tokens

Please check out Part 2 if you missed it!

In this post we're going to be adding a lot of tokens to our lexer. Most of these tokens will be rather straight-forward, but a few will be rather interesting. Let's start by making it easier for us to print tokens. Add the following C macros in the directives section of our Flex file right below #include <stdio.h>:

#define VTYPE(type, value) printf("%s(%s)\n", type, value)
#define TYPE(type) printf("%s\n", type)

We're using VTYPE to mean a token type that has a value (e.g. NUMBER(4), STRING("Hello Wombat"), ID(wombat)). A token using TYPE will not have an associated value (e.g. =). Let's update our number rule to make use of VTYPE:

[0-9]+ { VTYPE("NUMBER", yytext); }

Compiling and running our program should produce:

$ ./ruby program.rb 
NUMBER(4)  
NUMBER(8)  
NUMBER(15)  
NUMBER(16)  
NUMBER(23)  
NUMBER(42)  

Let's change our program.rb file to contain:

x = 5  

Running our program produces:

$ ./ruby program.rb 
Unknown token x  
Unknown token  
Unknown token =  
Unknown token  
NUMBER(5)  

Since spaces are a bit difficult to see, let's make one small modification to our Unknown token string:

. { fprintf(stderr, "Unknown token '%s'\n", yytext); }

Now when we run our program we get:

$ ./ruby program.rb 
Unknown token 'x'  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  
NUMBER(5)  

Let's fix each of these unknown tokens in order. First, in Ruby, x is what we call an identifier. An identifier is something Ruby needs to lookup in order to determine whether it's a local variable, a method, or if it's undefined. An identifier in ruby specifically starts with a lowercase letter or an underscore and contains lowercase and uppercase letters as well as underscores and numbers in the rest of the name. We'll write a rule to capture this:

[a-z_][a-zA-Z0-9_]* { VTYPE("ID", yytext); }

Running the program again we get:

$ ./ruby program.rb 
ID(x)  
Unknown token ' '  
Unknown token '='  
Unknown token ' '  
NUMBER(5)  

We can handle whitespace with:

[\t ] {}

Lastly we'll add a rule for the =:

"=" { TYPE("EQUAL"); }  

Now we get:

$ ./ruby program.rb
ID(x)  
EQUAL  
NUMBER(5)  

Fantastic! Our lexer recognizes our entire program! Let's add some additional TYPE token rules:

">" { TYPE("GT"); }  
"<" { TYPE("LT"); }  
">=" { TYPE("GTE"); }  
"<=" { TYPE("LTE"); }  
"!=" { TYPE("NEQUAL"); }  
"+" { TYPE("PLUS"); }  
"-" { TYPE("MINUS"); }  
"*" { TYPE("MULT"); }  
"/" { TYPE("DIV"); }  
"%" { TYPE("MOD"); }  
"!" { TYPE("EMARK"); }  
"?" { TYPE("QMARK"); }  
"&" { TYPE("AND"); }  
"|" { TYPE("OR"); }  
"[" { TYPE("LSBRACE"); }  
"]" { TYPE("RSBRACE"); }  
"(" { TYPE("LPAREN"); }  
")" { TYPE("RPAREN"); }  
"{" { TYPE("LBRACE"); }  
"}" { TYPE("RBRACE"); }  
"@" { TYPE("AT"); }  
"." { TYPE("DOT"); }  
"," { TYPE("COMMA"); }  
":" { TYPE("COLON"); }  

Now that we have a bunch ot TYPE tokens, let's write some VTYPE token rules. We'll start with floats. Floats are interesting because they have multiple forms. You can see here how Ruby determines if a number is a float (pardon the C-explosion):

for (;;) {  
  switch (c) {
    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':
      nondigit = 0;
      tokadd(c);
      break;

    case '.':
      if (nondigit) goto trailing_uc;
      if (seen_point || seen_e) {
        goto decode_num;
      } else {
        int c0 = nextc();
        if (c0 == -1 || !ISDIGIT(c0)) {
          pushback(c0);
          goto decode_num;
        }
        c = c0;
      }
      seen_point = toklen();
      tokadd('.');
      tokadd(c);
      is_float++;
      nondigit = 0;
      break;

    case 'e':
    case 'E':
      if (nondigit) {
        pushback(c);
        c = nondigit;
        goto decode_num;
      }
      if (seen_e) {
        goto decode_num;
      }
      nondigit = c;
      c = nextc();
      if (c != '-' && c != '+' && !ISDIGIT(c)) {
        pushback(c);
        nondigit = 0;
        goto decode_num;
      }
      tokadd(nondigit);
      seen_e++;
      is_float++;
      tokadd(c);
      nondigit = (c == '-' || c == '+') ? c : 0;
      break;

    case '_':   /* `_' in number just ignored */
      if (nondigit) goto decode_num;
      nondigit = c;
      break;

    default:
      goto decode_num;

There are a couple of interesting things happening in that function. First off if we see a . or an e or E we know we're talking about a float. Ruby designates this by is_float++ (since 0 is false in C). If Ruby sees an e or E it reads the next character to see if it's a + or - or a digit. If it's none of those we decode the number (which will generate an error at that point since the e-form needs a number afterwards.

There's a case we're not handling for regular numbers. In Ruby, a number can have _ in it. Ruby ignores this underscore when parsing numbers (but only a single _; while 1_2 is a valid number 1__2 is not). Before we get to floats, let's modify our definition of number to handle the case with _'s. If we change program.rb to say:

x = 5_1  

Then our output is:

$ ./ruby program.rb 
ID(x)  
EQUAL  
NUMBER(5)  
ID(_1)  

We can fix this by modifying our rule for NUMBER:

[0-9](_[0-9]|[0-9])* { VTYPE("NUMBER", yytext); }

Now we'll get the output:

$ ./ruby program.rb
ID(x)  
EQUAL  
NUMBER(5_1)  

Great! It would also be great if we could get rid of the underscore in our number, but we'll handle that in the future.

Moving forward, having a way to reference our NUMBER pattern will be useful. We'll want to make use of it in our FLOAT rule. So we're going to pull out the pattern into a definition. Definitions are reusable patterns that we can reference in Flex by wrapping them like '{DEFINITION}'. Definitions go at the top of our Flex file along with any directives. So right below:

%option noyywrap 

we'll add:

NUMBER [0-9](_[0-9]|[0-9])*  

So here we've created a definition with the name NUMBER that we can reference in our rules. Let's change our current number rule to use the definition we wrote:

{NUMBER} { VTYPE("NUMBER", yytext); }

Now that we've pulled out a definition for numbers, we can use it to write a rule for handling floats. First, let's add all the forms of floats we can come up with to our program.rb:

1.0  
1.01  
1_0.0  
1_0.0_1  
1e1  
1.0e1  
1.0e+1  
1.0e-1  
1.0E1  
1.0E+1  
1.0E-1  
1_1_1.0_1_4E+1_4_0  
10  
1_0  

This should indicate fairly well if our float rule covers all of our cases. A couple of regular numbers were added to make sure we're still capturing regular numbers correctly. Let's add the following rule right above our number rule:

{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) { VTYPE("FLOAT", yytext); }

Let's run that against our program:

$ ./ruby program.rb
FLOAT(1.0)  
FLOAT(1.01)  
FLOAT(1_0.0)  
FLOAT(1_0.0_1)  
FLOAT(1e1)  
FLOAT(1.0e1)  
FLOAT(1.0e+1)  
FLOAT(1.0e-1)  
FLOAT(1.0E1)  
FLOAT(1.0E+1)  
FLOAT(1.0E-1)  
FLOAT(1_1_1.0_1_4E+1_4_0)  
NUMBER(10)  
NUMBER(1_0)  

Great! It looks like everything is in order for floats. Now that we have floats working, we can add the rules for strings. Let's start with the rule:

\"[^"]*\" { VTYPE("STRING", yytext); }

This rule says we're looking for a double quote followed by zero or more characters that are not a double quote followed by a double quote. That will work for strings like "Hello World", but what about a string like "Hello \"World\"". The general rule about using \ as an escape is that any character can follow it. So let's modify our rule just a little to reflect this:

\"([^"]|\\.)*\" { VTYPE("STRING", yytext); }

Now we're saying that a string contains any character that is not a double quote, or any character preceeded by a \. The last thing we'll need to do is repeat the same thing for single quotes.

\'([^']|\\.)*\' { VTYPE("STRING", yytext); }

In the future, we'll need to take into account the fact that single quoted strings are not interpolated and double quoted are. Additionally there's a bunch of string madness around %, but we're going to ignore that for now.

Let's move on to constants. Constants are just like our ids except they'll start with a capital letter:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); }

Alright, now that we have our rules, let's try them out against a large ruby program. Let's change our program.rb to the following:

class Greeter  
  def initialize(name)
    @name = name.capitalize
  end

  def salute
    puts "Hello #{@name}!"
  end
end

g = Greeter.new("world")

g.salute  

It we run our lexer against it we get the following:

./ruby program.rb 
ID(class)  
CONSTANT(Greeter)  
ID(def)  
ID(initialize)  
LPAREN  
ID(name)  
RPAREN  
AT  
ID(name)  
EQUAL  
ID(name)  
DOT  
ID(capitalize)  
ID(end)  
ID(def)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(end)  
ID(end)  
ID(g)  
EQUAL  
CONSTANT(Greeter)  
DOT  
ID(new)  
LPAREN  
STRING("world")  
RPAREN  
ID(g)  
DOT  
ID(salute)  

Let's compare that to ripper's output:

irb  
> require 'pp'
=> true
> require 'ripper'
=> true
> pp Ripper.lex(File.read("program.rb"))
[[[1, 0], :on_kw, "class"],
 [[1, 5], :on_sp, " "],
 [[1, 6], :on_const, "Greeter"],
 [[1, 13], :on_nl, "\n"],
 [[2, 0], :on_sp, "  "],
 [[2, 2], :on_kw, "def"],
 [[2, 5], :on_sp, " "],
 [[2, 6], :on_ident, "initialize"],
 [[2, 16], :on_lparen, "("],
 [[2, 17], :on_ident, "name"],
 [[2, 21], :on_rparen, ")"],
 [[2, 22], :on_ignored_nl, "\n"],
 [[3, 0], :on_sp, "    "],
 [[3, 4], :on_ivar, "@name"],
 [[3, 9], :on_sp, " "],
 [[3, 10], :on_op, "="],
 [[3, 11], :on_sp, " "],
 [[3, 12], :on_ident, "name"],
 [[3, 16], :on_period, "."],
 [[3, 17], :on_ident, "capitalize"],
 [[3, 27], :on_nl, "\n"],
 [[4, 0], :on_sp, "  "],
 [[4, 2], :on_kw, "end"],
 [[4, 5], :on_nl, "\n"],
 [[5, 0], :on_ignored_nl, "\n"],
 [[6, 0], :on_sp, "  "],
 [[6, 2], :on_kw, "def"],
 [[6, 5], :on_sp, " "],
 [[6, 6], :on_ident, "salute"],
 [[6, 12], :on_nl, "\n"],
 [[7, 0], :on_sp, "    "],
 [[7, 4], :on_ident, "puts"],
 [[7, 8], :on_sp, " "],
 [[7, 9], :on_tstring_beg, "\""],
 [[7, 10], :on_tstring_content, "Hello "],
 [[7, 16], :on_embexpr_beg, "\#{"],
 [[7, 18], :on_ivar, "@name"],
 [[7, 23], :on_embexpr_end, "}"],
 [[7, 24], :on_tstring_content, "!"],
 [[7, 25], :on_tstring_end, "\""],
 [[7, 26], :on_nl, "\n"],
 [[8, 0], :on_sp, "  "],
 [[8, 2], :on_kw, "end"],
 [[8, 5], :on_nl, "\n"],
 [[9, 0], :on_kw, "end"],
 [[9, 3], :on_nl, "\n"],
 [[10, 0], :on_ignored_nl, "\n"],
 [[11, 0], :on_ident, "g"],
 [[11, 1], :on_sp, " "],
 [[11, 2], :on_op, "="],
 [[11, 3], :on_sp, " "],
 [[11, 4], :on_const, "Greeter"],
 [[11, 11], :on_period, "."],
 [[11, 12], :on_ident, "new"],
 [[11, 15], :on_lparen, "("],
 [[11, 16], :on_tstring_beg, "\""],
 [[11, 17], :on_tstring_content, "world"],
 [[11, 22], :on_tstring_end, "\""],
 [[11, 23], :on_rparen, ")"],
 [[11, 24], :on_nl, "\n"],
 [[12, 0], :on_ignored_nl, "\n"],
 [[13, 0], :on_ident, "g"],
 [[13, 1], :on_period, "."],
 [[13, 2], :on_ident, "salute"],
 [[13, 8], :on_nl, "\n"]]

Ripper is part of Ruby's stdlib, or standard library, and it provides a lot of insight into how ruby processes our code. While there are some large differences in how strings are being handled, keywords, and ivars, we'll cross those bridges when we get to them.

One last rule I'd like to add is for handling comments. Let's add this to the top of our rules:

#.*$ {}

Now comments will be ignored.

For here, our lexer is in a great state to start making our parser. In the next post we'll start generating our grammar for our parser.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 4 of this series.

Reconstructing Ruby, Part 2: Files and Errors

Please read Part 1 of this series if you haven't already.

Before we continue, it's worth mentioning that while we're writing a lexer using Flex, Ruby doesn't use Flex. Instead Ruby implements a giant function for handling all of its tokenizing. I've chosen to use Flex because I find it to be more approachable than implementing a lexer from scratch. Using Flex also requires writing far less code. If possible, I'd like to avoid writing a 1265 line long method. Also, it would take a large chunk of time we can spend implementing more interesting concepts.

There are some likely reasons why Ruby didn't use a tool like Flex and chose to implement its own lexer. Lexers are not terribly difficult to implement and using a custom lexer allows for possible optimizations that you would not be able to achieve with a general solution like Flex. Since I'm not worried about the marginal increase we'll continue to use Flex.

When we left off, our program was able to read content from STDIN and identify numbers. We're now going to allow our program to also accept input in the form of a file. If we had an input file named program.rb that looked like:

4  
8  
15  
16  
23  
42  

Then we want to be able to do this:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

Changing Flex to use a file as input is surprisingly easy. Flex reads its input from a global variable called yyin. Initially yyin is set to read from STDIN. All we need to do is change it to point to our file. Let's update our main function to look like this:

int main(int argc, char *argv[]) {  
  if (argc > 1) {
    yyin = fopen(argv[1], "r");
  }

  yylex();

  return EXIT_SUCCESS;
}

In the code above, we're checking if any arguments were provided. If an argument was provided, we'll treat that as a filename, open that file for reading, and point yyin to it. If we run make and ./ruby program.rb we will see the expected output!

Notice, however, that our program prints a newline between each line of output:

$ ./ruby program.rb
NUMBER: 4

NUMBER: 8

NUMBER: 15

NUMBER: 16

NUMBER: 23

NUMBER: 42  

It might not be obvious where these newlines are coming from, but let's take a look at the patterns page of Flex again. If you look at what . matches you'll notice

‘.’ any character (byte) except newline

Each number in program.rb is followed by a newline. Flex isn't matching the newline, so it gets printed to the screen. Let's modify our rules slightly to handle this situation. We'll add the following rule above the . rule:

\n {}

This rule will match newlines and then do nothing with them. Let's make again and run ./ruby program.rb. Voila! Much better!

$ ./ruby program.rb
NUMBER: 4  
NUMBER: 8  
NUMBER: 15  
NUMBER: 16  
NUMBER: 23  
NUMBER: 42  

Currently, we're ignoring invalid tokens. Future work will be easier if we print them out with a message instead. Let's change our . rule from:

. {}

to:

. { fprintf(stderr, "Unknown token %s\n", yytext); }

Now, if we had an invalid character we don't have a rule for, we'll see an error message like:

Unknown token #  

Our program currently contains both C code and Flex directives. The Flex directives define the lexer, and the C code instantiates it and runs our program. Those seem like separate concerns, so let's split this code into two separate files. First, let's change our Makefile to add a dependency for main.c:

all: ruby                   

ruby: main.c lex.yy.c  
    cc -o ruby main.c lex.yy.c

lex.yy.c: ruby.l  
    flex ruby.l

clean:  
    rm -rf lex.yy.c ruby

Next, we'll create our main.c file:

#include <stdio.h>                
#include <stdlib.h>               

int main(int argc, char *argv[]) {  
  if (argc > 1) {                 
    yyin = fopen(argv[1], "r");   
  }                               
  yylex();                        

  return EXIT_SUCCESS;            
}                                 

Your ruby.l should now look like this:

%{                                        
  #include <stdio.h>
%}

%option noyywrap

%%

[0-9]+ { printf("NUMBER: %s\n", yytext); }
\n {}
. { fprintf(stderr, "Unknown token %s\n", yytext); }

Make sure to remove stdlib.h from the includes because it's no longer a dependency in ruby.l. You can also remove the last %% in ruby.l because the user-code section is gone.

It might look like we're all set, but when we try and run make we hit a small snag:

$ make
flex ruby.l  
cc -o ruby main.c lex.yy.c  
main.c:6:5: error: use of undeclared identifier 'yyin'  
    yyin = fopen(argv[1], "r");
    ^
main.c:8:3: warning: implicit declaration of function 'yylex' is invalid in C99 [-Wimplicit-function-declaration]  
  yylex();
  ^

We're seeing these warnings because yyin and yylex are defined inside of lex.yy.c. Our main.c doesn't know anything about them.

We can easily fix this by using the extern keyword from C. Above our main function we'll add the following:

extern FILE* yyin;  
extern int yylex(void);  

extern lets the compiler know that it doesn't need to throw any warnings because it can expect these identifiers to be defined in another file. If we run make again everything compiles cleanly.

In the next post we'll start adding more rules to cover the rest of Ruby's syntax.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left it below!

Update: Read Part 3 of this series.