Reconstructing Ruby, Part 9: Memory leaks

Read Part 8 in case you missed it.

Right now our ruby program is leaking memory and we should do something about it. In order to figure out what to do, we should use a tool to help us diagnose where we have leaks. One such tool for the job is Valgrind, but unfortunately for Mac users, Valgrind won't work.

That's were Docker is going to come in. Docker will allow us to run one-off commands in a virtual machine. So the idea here will be to setup a Docker image with Valgrind and then run our program through it. The first step is to install Docker from the page above and then if you're on OSX install boot2docker. After installing boot2docker, make sure you run the following commands:

$ boot2docker init
$ boot2docker start
$ $(boot2docker shellinit)

Now we're ready to create our Docker image. First create a file called Dockerfile and put the following into it:

FROM debian
RUN apt-get update && apt-get install --no-install-recommends -y build-essential flex bison valgrind
WORKDIR /usr/src/ruby

That will provide us with all the tools we need. Now to build the image, run the following:

$ docker build -t ruby .

Now we can run Valgrind. To make it easy, add the following to the Makefile:

check:
     docker run -v `pwd`:/usr/src/ruby ruby bash -c "make clean && make && valgrind --leak-check=full --show-reachable=yes ./ruby program.rb"

For the purposes of these memory checks, your program.rb should only contain the digit 1 and nothing else. We'll add more later to catch some additional memory leaks.

This command mounts our current directory on the Docker image and then runs make clean, make, and Valgrind. If we run make check now, we see the following important lines:

definitely lost: 11 bytes in 1 blocks
...
still reachable: 17,026 bytes in 4 blocks

Let's fix the "definitely lost" ones first. If we look at the output you should see something like this:

==25== 11 bytes in 1 blocks are definitely lost in loss record 2 of 5
==25==    at 0x4C28BED: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25==    by 0x4EAF8E1: strdup (strdup.c:43)
==25==    by 0x400C42: main (in /usr/src/ruby/ruby)

Here we can see that strdup was called from main and it's value is never freed, but it would be nice to have more debugging information. Let's change our Makefile to help us out a bit. Replace:

cc -o ruby ${SRC}

with:

cc -O0 -g -o ruby ${SRC}

The two flags help provide debug information (the second flag is the letter O and the number 0):

   -O0 Means "no optimization": this level compiles the fastest and generates the most debuggable code.
 -g  Generate debug information.  Note that Clang debug information works best at -O0.

Using the -g flag will dump a folder of debug information into ruby.dSYM. You'll want to add the following to your .gitignore:

ruby.dSYM

Now when we run make check we get much more helpful output:

==25== 11 bytes in 1 blocks are definitely lost in loss record 2 of 5
==25==    at 0x4C28BED: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25==    by 0x4EAF8E1: strdup (strdup.c:43)
==25==    by 0x400C42: main (main.c:14)

We now know our line leaking memory is on main.c:14

state.source_file = strdup(argv[1]);  

In this case, the strdup is unnecessary and can be removed.

The next memory leak we'll look at is this one:

==26== 568 bytes in 1 blocks are still reachable in loss record 3 of 4
==26==    at 0x4C28BED: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==26==    by 0x4E972AA: __fopen_internal (iofopen.c:76)
==26==    by 0x400C53: main (main.c:15)

And our violating line is:

yyin = fopen(argv[1], "r");  

We've opened a file, but we've failed to close it. We'll fix this by adding the following to main.c after yyparse is called:

if(argc > 1) {  
  fclose(yyin);
}

The next three errors are a bit trickier. Let's take a look at one of them:

==25== 8 bytes in 1 blocks are still reachable in loss record 1 of 3
==25==    at 0x4C28BED: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25==    by 0x40312E: yyalloc (lex.yy.c:2020)
==25==    by 0x402C3B: yyensure_buffer_stack (lex.yy.c:1717)
==25==    by 0x401404: yylex (lex.yy.c:758)
==25==    by 0x400F46: yyparse (parse.tab.c:1364)
==25==    by 0x400CB6: main (main.c:18)

Here we can see that after we call yyparse from main, it calls yylex. The meaning of this is that we haven't freed up some of the memory the lexer has used. In order to fix this, we need to tell the lexer we're done. We'll do this by adding two things. With our other externs in main, add the following:

extern int yylex_destroy(void);  

and then right before the return add this:

yylex_destroy();  

Now when we run make check we come back all clear. But we are leaking memory elsewhere. Let's change our program.rb to cover all of our cases. Make you program.rb look like this:

class Foo  
  def initialize(arg1)
    x = arg1
    x.length          
  end
end

Foo.new("bar")  
Foo.new('bar')  

When we run make check again we see 4 more memory leaks. If you look in ruby.l you'll notice we have 4 calls to strdup. We'll need to free this memory. This is going to take a bit of work, which we'll end up undoing later, but it's a good exercise for the time being. Every line we have a tID, tCONSTANT, and tSTRING and free it. For example, change this line:

| tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN

to:

| tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN {
  free($<sval>1);
  free($<sval>3);
  free($<sval>5);
}

This is a way of typecasting our values. There is however an easier way of handling this. If we define our tokens to have a particular type then we won't need to typecast it later. Up where we've defined our tokens extract out the following:

%token <sval> tSTRING tCONSTANT tID

In which case we can write our previous line as follows:

| tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN { 
  free($1);
  free($3);
  free($5);
}

If we do the same to all the other declarations with a tSTRING, tCONSTANT, and tID and then run make check we'll see we've eliminated all other memory issues. We'll want to run our make check frequently to make sure we don't introduce any memory leaks in our implementation.

At this point our next steps will be to generate an AST of nodes from our grammar.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Reconstructing Ruby, Part 8: Make our parser reentrant

Read Part 7 in case you missed it.

Reentrantcy basically means our parser doesn't use any global state. This will eventually allow us to write threaded programs. Since Ruby supports threading (kind of) we'll want to make our parser reentrant. We'll talk more about threading once we start working on the actual implementation, but for now, we'll just make the necessary changes to our parser.

Luckily bison provides a convenient directive for generating a reentrant parser. All we need to do is add the following to our parser:

%pure-parser

This makes it so yylval, yylloc, and other internal variables are no longer global variables but local. Since they are all now local variables we're we could potentially run yyparse in multiple threads without consequence.

At this point, running make will generate a bunch of errors. We'll tackle this set first:

parse.tab.c:1368:16: error: too many arguments to function  
  call, expected 0, have 1
      yychar = YYLEX;
               ^~~<del>
parse.tab.c:713:23: note: expanded from macro 'YYLEX'  
# define YYLEX yylex (&yylval)
               </del>~<del>  ^</del>~~~~
parse.y:29:3: note: 'yylex' declared here  
  extern int yylex(void);
  ^
1 error generated.  

The first error is telling us that yylex now takes an argument. In this case, yylval needs to be passed in. We'll need to change our extern to reflect this. Move this line:

  extern int yylex(void);

below our %union directive and change void to YYSTYPE *yylval like so:

%union {
  int ival;
  float fval;
  char *sval;
}

%{
  extern int yylex(YYSTYPE *yylval);
%}

We need to have this declaration below %union because the union directive will define the type YYSTYPE.

To fix the next error:

ruby.l:21:3: error: use of undeclared identifier 'yylval'  
{ yylval.sval = strdup(yytext); TOKEN(STRING); }

We'll need to change the definition of yylex in our lexer. It should now accept yylval as an argument. We'll do this by definining a YY_DECL macro in our ruby.l file like so:

%{
  #include <stdio.h>
  #include <string.h>
  #include "node.h"
  #include "parse.tab.h"
  #define YY_DECL int yylex(YYSTYPE *yylval)
  #define TOKEN(id) return t##id
  #define KEYWORD(id) return k##id
%}

If you run make now, you'll see it complains about yylval being a pointer and that we should use -> instead of .. Let's make those changes to our lexer rules:

\"([^"]|\\.)*\" { yylval->sval = strdup(yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { yylval->sval = strdup(yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) { yylval->fval = atof(yytext); TOKEN(FLOAT); }
{NUMBER} { yylval->ival = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* { yylval->sval = strdup(yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* { yylval->sval = strdup(yytext); TOKEN(CONSTANT); }

If we run make one last time everything compiles correctly. We now have a reentrant parser!

One other thing that will be useful to setup now is to change our parser to take a parameter. This parameter will be a struct that holds information about the current parser status. We'll start simple by only storing the filename and line number for now.

First let's create the file node.h where we'll store our struct definition. In the future, we'll use this file to define some nodes for our Abstract Syntax Tree or AST. Inside of the node.h file, add the following:

#ifndef _NODE_H_
#define _NODE_H_

typedef struct parser_state {  
    char *source_file;
    int source_line;
} parser_state;

#endif

In the actual ruby implementation, this struct is known as parser_params. source_file and source_line are both parser_ruby_sourcefile and parser_ruby_sourceline respectively.

Now that we've defined our struct we'll need to include node.h before we include parse.tab.h anywhere. You'll need to make this change in both main.c and ruby.l. You'll also want to include node.h at the top of parse.y.

Now right below %pure-parser add the following lines:

%parse-param { parser_state *state }
%lex-param { state }

The first line changes yyparseto take in an argument to manage it's state. The second line also passes this state to the lexer. We'll need to make some changes to yylex as well. First we'll need to change our forward declaration of yylex in parse.y to the following:

%{                                                       
  extern int yylex(YYSTYPE *yylval, parser_state *state);
%}                   

Then you'll change this line in ruby.l:

#define YY_DECL int yylex(YYSTYPE *yylval)

to:

#define YY_DECL int yylex(YYSTYPE *yylval, parser_state *state)

Then in order to pull this all together we'll need to change main so that it passes in a parser_state, and additionally, we'll set the filename:

int main(int argc, char *argv[]) {  
  parser_state state = { NULL, 0 };

  if(argc > 1) {
    state.source_file = strdup(argv[1]);
    yyin = fopen(argv[1], "r");
  }

  yyparse(&state);
  return EXIT_SUCCESS;
}

You'll also need to add #include <string.h> to the top of your main.c in order to support strdup. Lastly, in order to see our parser state working correctly change the program rule in parse.y to the following:

program: expressions { printf("%s\n", state->source_file); }  

Now let's try it out:

$ make
$ ./ruby program.rb
program.rb  

Nice it works!

One problem we currently have in our implementation is that we're leaking memory. In the next post I'll discuss setting up Valgrind, a memory profiler, using Docker (since OSX doesn't support Valgrind very well) in order to catch and fix our memory leaks.

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Reconstructing Ruby, Part 7: Expanding our grammar

Read Part 6 in case you missed it.

When we left off, we were getting a syntax error from running ./ruby program.rb. If you're up to date, then you should see the following:

ID(class)  
syntax error. Unexpected "class" on line 2  

We can see that the lexer is identifying class as an ID and that our grammar doesn't expect to see a token of typeID. Ideally, our lexer would treat this as a keyword since class is something special in Ruby.

The typical definition for a class in ruby looks something like:

class Foo  
  # expressions
end  

Since defining a class requires two keywords class and endlet's add them bost as keywords. In order to do this, we'll add the following rules to the top of rules in ruby.l:

class { KEYWORD(CLASS); }  
end { KEYWORD(END); }  

And then add the following macro below our TOKEN macro:

#define KEYWORD(id) return k##id

In our parse.y file we'll want to add tokens for our keywords:

%token kCLASS kEND

Now we'll add a rule to our grammar for defining classes. Modify expressions to contain the following:

expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }
          | kCLASS tCONSTANT expressions kEND

Now when we run make && ./ruby program.rb we get the following:

CONSTANT(Greeter)  
ID(def)  
syntax error. Unexpected "def" on line 3  

For def, we'll follow the same process we did for class. The grammar rules will look a little different:

| kDEF tID expressions kEND
| kDEF tID tLPAREN tID tRPAREN expressions kEND

While this doesn't handle all of the cases yet, it handles both of our method definitions in our example app. Now when we run make && ./ruby program.rb we see the following:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
syntax error. Unexpected "@" on line 4  

Let's define rules in our grammar for variable:

variable: tID  
        | tAT tID

Then expand expression to contain:

| variable

Now our error message is:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
syntax error. Unexpected "=" on line 4  

So let's add assignment to our grammar.

assignment: variable tEQUAL expression  
expression: tNUMBER  
          | expression tPLUS expression { printf("%d\n", $1 + $3); }
          | kCLASS tCONSTANT expressions kEND
          | kDEF tID expressions kEND
          | kDEF tID tLPAREN tID tRPAREN expressions kEND
          | variable
          | assignment

If we run make && ./ruby program.rb we see the dreaded shift/reduce...

parse.y: conflicts: 1 shift/reduce  

This one is fairly easy to resolve. In the case of equality we want everything to the right of the equal sign to be resolved first so we're going to add the following to parse.y:

%right tEQUAL

Now when we run make && ./ruby program.rb we don't see a shift reduce, but instead get the following error:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
ID(name)  
syntax error. Unexpected "." on line 4  

Now we'll write grammar rules for all of the method calls in our example program:

method_call: variable tDOT tID  
           | tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN
           | tID tSTRING

And add method_call to our expression rule:

| method_call

Again these are very specific to our program, but we'll extend it later on. Now when we run our program we see the following:

CONSTANT(Greeter)  
ID(initialize)  
ID(name)  
ID(name)  
ID(name)  
ID(capitalize)  
ID(salute)  
ID(puts)  
STRING("Hello #{@name}!")  
ID(g)  
CONSTANT(Greeter)  
ID(new)  
STRING("world")  
ID(g)  
ID(salute)  

So, we no longer have any syntax errors, but we shouldn't be printing the token type and value. It would be great if we could both identify the token and pass the value like we did with numbers:

{NUMBER} { yylval = atoi(yytext); TOKEN(NUMBER); }

Let's change CONSTANT to do exactly this. Change:

[A-Z][a-zA-Z0-9_]* { VTYPE("CONSTANT", yytext); TOKEN(CONSTANT); }

to:

[A-Z][a-zA-Z0-9_]* { yylval = strdup(yytext); TOKEN(CONSTANT); }

At the top of ruby.l with the other #include directives, you'll also want to add:

#include <string.h>

Note: We want to duplicate yytext because it's not a value we control and it could be freed while we are using it. If you're running this on mingw32 you might not have strdup and will need to define your own.

If you run make you'll see the following warning:

ruby.l:24:10: warning: incompatible pointer to integer conversion  
assigning to 'YYSTYPE' (aka 'int') from 'char *' [-Wint-conversion]  

This is because we want to store a char * in yylval, but it's type is int. We ideally want to be able to store both types, depending on the token. Additionally, we have a float we'd like to store as well.

In order to do this we need to change yylval into a union. Let's add the following to parse.y above the %left precedence declaration:

%union {
  int ival;
  float fval;
  char *sval;
}

Then, we'll need to change all the references of yylval to use the union values. Let's remove all the VTYPE calls as well as the VTYPE directive. You should have the following:

\"([^"]|\\.)*\" { yylval.sval = strdup(yytext); TOKEN(STRING); }
\'([^']|\\.)*\' { yylval.sval = strdup(yytext); TOKEN(STRING); }
{NUMBER}(\.{NUMBER}|(\.{NUMBER})?[eE][+-]?{NUMBER}) {
  yylval.fval = atof(yytext); TOKEN(FLOAT); }
{NUMBER} { yylval.ival = atoi(yytext); TOKEN(NUMBER); }
[a-z_][a-zA-Z0-9_]* {yylval.sval = strdup(yytext); TOKEN(ID); }
[A-Z][a-zA-Z0-9_]* {
  yylval.sval = strdup(yytext); TOKEN(CONSTANT); }

Just to make things easier in the future, let's refactor our parser grammar to be easier to read:

program: expressions

expressions: expressions expression  
           | expression

expression: class_definition  
          | binary_expression
          | method_definition
          | variable
          | assignment
          | method_call
          | value

binary_expression: expression tPLUS expression

assignment: variable tEQUAL expression

class_definition: kCLASS tCONSTANT expressions kEND

method_definition: kDEF tID expressions kEND  
                 | kDEF tID tLPAREN tID tRPAREN expressions kEND

method_call: variable tDOT tID  
           | tCONSTANT tDOT tID tLPAREN tSTRING tRPAREN
           | tID tSTRING

variable: tID  
        | tAT tID

value: tNUMBER  

If you were to run make now, you'd see the following error:

parse.y:37.65-66: $1 of ‘binary_expression’ has no declared type  
parse.y:37.70-71: $3 of ‘binary_expression’ has no declared type  

This is happening because we have now defined yylval as a union and each rule needs to reduce to an acceptable values for each rule the terminal is used in. We haven't defined what $1 and $3 are supposed to be in:

binary_expression: expression tPLUS expression {  
  printf("%d\n", $1 + $3); }

For now, just remove the { printf("%d\n", $1 + $3); } part. Running make && ./ruby program.rb should produce no compile or runtime errors.

In the next post, we'll make our parser reentrant!

If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!

Update: read Part 8 of this series.