This is the Wiki page for Compilers101, a Yahoo Groups mailing list for people writing their first compiler.

We recently lost our wiki in a disk crash, but we are now in the process of restoring the previous data. For now the pages will not be editable, and some links make be broken; please have patience, we'll have it back as soon as we can.

1. Coding Competition!

I'm told that sushi chefs are never told by their instructors when they are ready to work on their own - what they do is watch the instructor until they know they are ready, and then they simply start making their own sushi.

This coding competition is sushi graduation day for compiler writers. Once you are confident that you are ready to write your own compiler, implement the language described here and upload it to the group site. Consider it like an "apprentice piece" from the days of old.

I am hoping that we will over time collect several implementations of the same language, all done in different styles using different techniques and tools. Hopefully each one will be instructive and elegant in its own way.

Perhaps one day we may even invite some 'celebrity compiler writers' to show us their stuff. The language we've chosen is very simple, and an experienced compiler writer may even be able to implement it in a few hours; the rest of us should probably put aside a weekend to do it properly.

The idea for this came when Peter Gray dropped in on the compilers101 group to announce that he had released a port of Jack Crenshaw's compiler in C. Crenshaw's compiler was written as a tutorial for people who had not written a compiler before - very much in the spirit of this group, except that in compilers101 we try to start where Crenshaw leaves off. Crenshaw's compiler is much in the style of Wirth's early Pascals - top-down procedural recursive descent with a simple stack machine for code generation. It will be interesting to compare against our efforts.

Enter here for the competition rules.

2. Older stuff

Members of the group should add their own content - I'm going to start it off by documenting my own demo code below:

The compiler in the page linked to above is pretty simple, but it has enough structure in place to be a good starting point for a first compiler project. At the moment, it compiles source like this:

var i,j,k,l
j = 44
k = 22
loop
  if ((((j/2) == k) && (j != 44)) || ((j+k) == 100)) i = (2 + 3) * ((4 - 5) / 6) else break
  j = j+1
endloop

and generates output like the listing below. Note, it is very heavy in diagnostic output. Although this was added primarily because this is a teaching tool, I strongly recommend that you leave it in your own compiler, because it greatly simplifies debugging both language development and code generation.

#include <stdio.h>
int main(int argc, char **argv) {

int _t0; int _t1; int _t2; int _t3; int _t4; int _t5; int _t6;
int _t7; 
// other temporaries deleted for brevity...

The first phase of compiling is simply to generate an Abstract Syntax Tree (AST). This is similar to a parse tree, but boiled down to the bare essentials.

// --------------------- PARSE PHASE ----------------------
// 0: SEQUENCE -1 -1
// 3: LINE 1: var i,j,k,l
// 6: TAG i[@Stringpool 0]
// 9: TAG j[@Stringpool 2]
// 12: TAG k[@Stringpool 4]
// 15: TAG l[@Stringpool 6]
// 18: VAR l[@AST 15] TYPE=INT
// 21: DECLARE [@AST 18]
// 23: VAR k[@AST 12] TYPE=INT
// 26: DECLARE [@AST 23]
// 28: SEQUENCE 26 21
// 31: VAR j[@AST 9] TYPE=INT
// 34: DECLARE [@AST 31]
// 36: SEQUENCE 34 28
// 39: VAR i[@AST 6] TYPE=INT
// 42: DECLARE [@AST 39]
// 44: SEQUENCE 42 36
// 47: SEQUENCE -1 -1
    //                   ;
    //                  / \
    //                 /   \
    //                /     \
    //               /       \
    //              /         ;
    //             /         / \
    //            /         /   \
    //           /         /     \
    //          /         /       \
    //         /         /         ;
    //        /         /         / \
    //       /         /         /   \
    //      DECLARE   /         /     \
    //     /         DECLARE   /       \
    //    i         /         DECLARE   DECLARE
    //             j         /         /
    //                      k         l
// 50: LINE 2: j = 44
// 53: CONST INT 44
// 56: ASSIGN [declared @AST 31] [value @AST 53]
// 59: SEQUENCE -1 -1
    //      =
    //     / \
    //    j   44
// 62: LINE 3: k = 22
// 65: CONST INT 22
// 68: ASSIGN [declared @AST 23] [value @AST 65]
// 71: SEQUENCE -1 -1
    //      =
    //     / \
    //    k   22
// 74: LINE 4: loop
// 77: LOOP 1 2
// 80: SEQUENCE -1 -1
    //    LOOP

// 83: LINE 5:   if ((((j/2) == k) && (j != 44)) || ((j+k) == 100)) i = (2 + 3) * ((4 - 5) / 6) else break
// 86: CONST INT 2
// 89: DIV [@AST 31] [@AST 86]
// 92: [@AST 89] EQ [@AST 23]
// 95: CONST INT 44
// 98: [@AST 31] NE [@AST 95]
// 101: AND [@AST 92] [@AST 98]
// 104: ADD [@AST 31] [@AST 23]
// 107: CONST INT 100
// 110: [@AST 104] EQ [@AST 107]
// 113: OR [@AST 101] [@AST 110]
// 116: CONST INT 2
// 119: CONST INT 3
// 122: ADD [@AST 116] [@AST 119]
// 125: CONST INT 4
// 128: CONST INT 5
// 131: SUB [@AST 125] [@AST 128]
// 134: CONST INT 6
// 137: DIV [@AST 131] [@AST 134]
// 140: MUL [@AST 122] [@AST 137]
// 143: ASSIGN [declared @AST 39] [value @AST 140]
// 146: BREAK 1 2
// 149: IFTHENELSE [condition @AST 113] [then-statement @AST 143] [else-statements @AST 146]
// 153: SEQUENCE -1 -1

    //                             if (...) then ... else ...
    //                                         |
    //                      |-------------------------------------------------------|
    //                      |                                    |                  |
    //                     ||                                   =                 BREAK
    //                      |                                    |
    //              |--------------------|          |--------------|
    //              |                    |          |              |
    //             &&                   ==          i             '*'
    //              |                    |                         |
    //        |------------|          |-------|             |-----------|
    //        |            |          |       |             |           |
    //       ==           !=        '+'      100          '+'          '/'
    //        |            |          |                     |           |
    //      |------|    |-----|    |----|                |----|       |------|
    //      |      |    |     |    |    |                |    |       |      |
    //    '/'      k    j    44    j    k                2    3     '-'      6
    //      |                                                         |
    //   |----|                                                    |----|
    //   |    |                                                    |    |
    //   j    2                                                    4    5
// 156: LINE 6:   j = j+1
// 159: CONST INT 1
// 162: ADD [@AST 31] [@AST 159]
// 165: ASSIGN [declared @AST 31] [value @AST 162]
// 168: SEQUENCE -1 -1
    //        =
    //       / \
    //      /  '+'
    //     /   / \
    //    j   j   1
// 171: LINE 7: endloop
// 174: ENDLOOP 1 2
// 177: SEQUENCE -1 -1

    //    ENDLOOP

The next phase is to walk the data structure and generate code. In this demonstration we generate simple 3-address code using C as an intermediate language. A real compiler would generate assembly language or binary directly, but that's actually pretty easy to add.

// --------------------- CODEGEN PHASE ----------------------

// 3: LINE 1: var i,j,k,l
// 44: SEQUENCE 42 36

    //                   ;
    //                  / \
    //                 /   \
    //                /     \
    //               /       \
    //              /         ;
    //             /         / \
    //            /         /   \
    //           /         /     \
    //          /         /       \
    //         /         /         ;
    //        /         /         / \
    //       /         /         /   \
    //      DECLARE   /         /     \
    //     /         DECLARE   /       \
    //    i         /         DECLARE   DECLARE
    //             j         /         /
    //                      k         l
    int i;
    int j;
    int k;
    int l;

// 50: LINE 2: j = 44
// 56: ASSIGN [declared @AST 31] [value @AST 53]
    //      =
    //     / \
    //    j   44
    _t53 = 44;
    j = _t53;

// 62: LINE 3: k = 22
// 68: ASSIGN [declared @AST 23] [value @AST 65]
    //      =
    //     / \
    //    k   22
    _t65 = 22;
    k = _t65;

// 74: LINE 4: loop
// 77: LOOP 1 2

    //    LOOP

L01:

// 83: LINE 5:   if ((((j/2) == k) && (j != 44)) || ((j+k) == 100)) i = (2 + 3) * ((4 - 5) / 6) else break
// 149: IFTHENELSE [condition @AST 113] [then-statement @AST 143] [else-statements @AST 146]
    //                             if (...) then ... else ...
    //                                         |
    //                      |-------------------------------------------------------|
    //                      |                                    |                  |
    //                     ||                                   =                 BREAK
    //                      |                                    |
    //              |--------------------|          |--------------|
    //              |                    |          |              |
    //             &&                   ==          i             '*'
    //              |                    |                         |
    //        |------------|          |-------|             |-----------|
    //        |            |          |       |             |           |
    //       ==           !=        '+'      100          '+'          '/'
    //        |            |          |                     |           |
    //      |------|    |-----|    |----|                |----|       |------|
    //      |      |    |     |    |    |                |    |       |      |
    //    '/'      k    j    44    j    k                2    3     '-'      6
    //      |                                                         |
    //   |----|                                                    |----|
    //   |    |                                                    |    |
    //   j    2                                                    4    5
    _t31 = j;
    _t86 = 2;
    _t89 = (_t31 / _t86);
    _t23 = k;
    _t92 = (_t89 == _t23);
    if (!_t92) goto L06;
L07:
    _t31 = j;
    _t95 = 44;
    _t98 = (_t31 != _t95);
    if (_t98) goto L03;
L06:
    _t31 = j;
    _t23 = k;
    _t104 = (_t31 + _t23);
    _t107 = 100;
    _t110 = (_t104 == _t107);
    if (!_t110) goto L05;
L03:
    _t116 = 2;
    _t119 = 3;
    _t122 = (_t116 + _t119);
    _t125 = 4;
    _t128 = 5;
    _t131 = (_t125 - _t128);
    _t134 = 6;
    _t137 = (_t131 / _t134);
    _t140 = (_t122 * _t137);
    i = _t140;
    goto L04;
L05:
    goto L02;
L04:

// 156: LINE 6:   j = j+1
// 165: ASSIGN [declared @AST 31] [value @AST 162]
    //        =
    //       / \
    //      /  '+'
    //     /   / \
    //    j   j   1
    _t31 = j;
    _t159 = 1;
    _t162 = (_t31 + _t159);
    j = _t162;

// 171: LINE 7: endloop
// 174: ENDLOOP 1 2

    //    ENDLOOP

    goto L01;
L02:

    exit(0);
    return(0);
}

So, how did we do that? Here you go, the next two sections are the guts of the compiler...

Let's start with the parser. Any parser tool will do, this particular one happens to be something I wrote myself for another project. You should probably use Yacc yourself.

statement:
   <eof>          {
                    int i;
                    exit_flag = TRUE;
                    $$ = ERROR;
                  }
| "#.*"           { /* comment */ $$ = ERROR; }
| "var" <sp> <newname> <sp> <declist> {
                    if ($5 != ERROR) {
                      $$ = entrypoint = make_binary_tuple(SEQUENCE, make_unary_tuple(DECLARE, make_binary_tuple(VAR, $3, INT)), $5);
                    } else {
                      $$ = entrypoint = make_unary_tuple(DECLARE, make_binary_tuple(VAR, $3, INT));
                    }
                  }
| <simpleclause>  {
                    /* handled in simpleclause */
                    $$ = entrypoint = $1;
                  }
| "input" <sp> <name> {
                    $$ = entrypoint = make_unary_tuple(INPUT, $3);
                  }
| "print" <sp> <expr> {
                    $$ = entrypoint = make_unary_tuple(PRINT, $3);
                  }
| "if" <sp> "\(" <expr> "\)" <sp> <simpleclause> <optelse> {
                    if ($8 == ERROR) {
                      $$ = entrypoint = make_binary_tuple(IFTHEN, $4, $7);
                    } else {
                      $$ = entrypoint = make_nary_tuple(IFTHENELSE, $4, $7, $8);
                    }
                  }
| "loop"         {
                    int loopstart, exitlab;
                    loopstart = ++nextlab;
                    push_block(LOOP, loopstart);
                    exitlab = ++nextlab;
                    push_block(ENDLOOP, exitlab);
                    $$ = entrypoint = make_binary_tuple(LOOP, loopstart, exitlab);
                  }
| "endloop"        {
                    int loopstart, exitlab;
                    // this should be in codegen, not parser
                    pop_block(ENDLOOP, &exitlab);
                    pop_block(LOOP, &loopstart);
                    //put_label(loopstart);
                    //put_label(exitlab);
                    //put_goto(loopstart);
                    $$ = entrypoint = make_binary_tuple(ENDLOOP, loopstart, exitlab);
                  }
| <!multi> ".*" "$"        { // what happens when we have an error
                             // in a line that does contain a ';'???
                    fprintf(stderr, "Syntax: %s\n\n", @1.text);
                    $$ = ERROR;
                  }
;

simpleclause: <assignment> {
                    $$ = $1;
                  }
| "break"         {
                    int loopstart, exitlab;
                    pop_block(ENDLOOP, &exitlab);
                    pop_block(LOOP, &loopstart);
                    push_block(LOOP, loopstart);
                    push_block(ENDLOOP, exitlab);
                    // put_label(exitlab);
                    $$ = make_binary_tuple(BREAK, loopstart, exitlab);
                    // TO DO: take sequence following LOOP, hook it into
                    // slot in loop, so that SEQUENCE following LOOP
                    // is the code *after* the loop.
                  }
;

optelse: <sp> "else" <sp> <simpleclause> {
                    $$ = $4;
                  }
| "" { $$ = ERROR; };

assignment: <name> "[ \t]*=[ \t]*" <expr> {
                    $$ = make_binary_tuple(ASSIGN, $1, $3);
                  };

multi: ".*;" { };

declist: "," <sp> <newname> <sp> <declist> {
                    if ($5 != ERROR) {
                      $$ = make_binary_tuple(SEQUENCE, make_unary_tuple(DECLARE, make_binary_tuple(VAR, $3, INT)), $5);
                    } else {
                      $$ = make_unary_tuple(DECLARE, make_binary_tuple(VAR, $3, INT));
                    }
                  }
| "" { $$ = ERROR; }
;

termin:  ";[ \t]*" { } | "[ \t]*$" { };

sp: "[ \t]*" { };

expr: <term> "[ \t]*" <op> "[ \t]*" <expr> <sp>
                     {
                       $$ = make_binary_tuple(/* op */$3, /*left*/$1, /*right*/$5);
                     }
|     <term>         { $$ = $1; }
;

monop: "-" { $$ = NEG; } | "~" { $$ = NOT; /* "\\" didn't work - parser bug */};

op: "\*" { $$ = MUL; }  /* no pretence about operator precedence yet! */
|   "+" { $$ = ADD; }
|   "-" { $$ = SUB; }
|   "/" { $$ = DIV; }
|   "\&\&" { $$ = AND; }
|   "\|\|" { $$ = OR; }
|   "==" { $$ = EQ; }
|   "<=" { $$ = LE; }
|   "<" { $$ = LT; }
|   ">=" { $$ = GE; }
|   ">" { $$ = GT; }
|   "!=" { $$ = NE; }
;

term: <monop> <simpleterm> { /* -fred is OK, --fred is not */
                         $$ = make_unary_tuple($1, $2);
                       }
| <simpleterm> {
                 $$ = $1;
               }
;

simpleterm: <number>       { $$ = $1; }
|     <name>         { $$ = $1; }
|     "\(" <expr> "\)" { $$ = $2; }
;

number: "[0-9][0-9]*" { $$ = make_int_const(INT, @1.text); };

name: "[a-z][a-z0-9]*" { $$ = getvar(@1.text); };
newname: "[a-z][a-z0-9]*" {
                         /* One of the advantages of not tokenising first
                            is that we can have context-sensitive actions
                            for the same token.  Here we enter variables
                            into the namelist on the fly as we parse.

                            However if we were stricter about making an AST
                            first then compiling from the finished parse
                            tree, then this would not be helpful. */
                         $$ = newtag(@1.text);
};

The next section is the code generator. That's it - it's really that simple.

/* This is the type of objects passed around as $n in the tacc parser  */
#define USERTYPE TRIP

static TRIP entrypoint;  // temporary hack until code is restructured

// nested blocks/control structures...
// not much used yet in the demo language
// purpose is to find enclosing loops for break,
// make sure something that starts as a for loop doesn't finish
// as an until, etc :-)

// This is most needed when compiling statement at a time rather
// than recursing in the parser to compile a whole block.

#define MAX_NEST 40
static int control_block[MAX_NEST];
static int type_check[MAX_NEST];
static int next_nest = 0;

void push_block(int typecheck, int lab)
{
  // when we actually have more types of nested structure, we'll
  // go to the bother of checking that starts and finishes match up...
  control_block[next_nest++] = lab;
}

void pop_block(int typecheck, int *lab)
{
  *lab = control_block[--next_nest];
}

static int nextlab = 0;

// Codegen is the guts of the compiler, which effectively serialises
// the AST into sequentially executable statements.  A further phase
// is required to actually generate executable code.

// (However... if you just want to interpret, you can just about
// get away with doing that directly in the AST without this stage)


void codegen(TRIP root, int thenlab, int elselab, int donelab) {
  // the labels are only used in short-circuit conditionals - it's
  // easier to handle that here than have a complex structure of
  // mutually recursive routines.  When not needed they will be -1.

  TRIP cond, thenpart, elsepart;

  switch (opsym(root)) {

  case CONST:
    loadconst("_t%d = %d;\n", root, rightchild(root));
    break;

  case DECLARE:
    root = leftchild(root);
    declare(stringpool+rightchild(leftchild(root)));
    break;

  case VAR:
    loadvar("_t%d = %s;\n", root, nameof(root));
    break;

  case IFTHENELSE:
    /* We originally used two trips to simulate a three-way branch, but
       now that we have n-ary tuples, this code has been changed to make
       the new IFTHENELSE opcode use a quad instead.

       Note how the use of this simple integer-base AST compares to one
       using structs; instead of root->leftchild we have leftchild(root) etc.
       And we have an arbitrary number of 'struct elements' using the
       'nthchild()' macro.  This saves considerable space (or complexity)
       over most schemes for ASTs that extend the format from plain triples. */

    cond = leftchild(root); thenpart = rightchild(root); elsepart = nthchild(root, 3);
    {
      int thenlab = ++nextlab;
      int donelab = ++nextlab;
      int elselab = ++nextlab;
      codegen(cond, thenlab, elselab, thenlab);
      put_label(thenlab);
      codegen(thenpart, -1, -1, -1);
      put_goto(donelab);
      put_label(elselab);
      codegen(elsepart, -1, -1, -1);
      put_label(donelab);
    }
    break;

  case IFTHEN:
    /* Simple if ... then ... */
    cond = leftchild(root); thenpart = rightchild(root);
    {
      int thenlab = ++nextlab;
      int donelab = ++nextlab;
      codegen(cond, thenlab, donelab, thenlab);
      put_label(thenlab);
      codegen(thenpart, -1, -1, -1);
      put_label(donelab);
    }
    break;

  case ASSIGN:
    codegen(rightchild(root), -1, -1, -1);
    store("%s = _t%d;\n", nameof(leftchild(root)), rightchild(root));
    break;

  case LOOP:
    put_label(leftchild(root));
    break;

  case ENDLOOP:
    put_goto(leftchild(root));
    put_label(rightchild(root));
    break;

  case BREAK:
    put_goto(rightchild(root));
    break;

  case SEQUENCE:
    codegen(leftchild(root), -1, -1, -1);
    codegen(rightchild(root), -1, -1, -1);
    break;

  // handling short-circuit evaluation of booleans is a bit tricky
  // and I don't guarantee that this code is 100% the right way
  // to do it - actually it depends a lot on the language being
  // compiled.  Algol differs from Pascal differs from C, for example.

  case AND:
  case OR:
    {
    int truelab = thenlab, falselab = elselab, dropthrough = donelab;
    int thenlab, elselab, donelab;

    int nexttestlab = ++nextlab;
    if (opsym(root) == AND) {
      thenlab = nexttestlab; elselab = falselab;
    } else {
      thenlab = truelab; elselab = nexttestlab;
    }
    donelab = nexttestlab;
    codegen(leftchild(root), thenlab, elselab, donelab);
    /* if the first one was true, drop through to next part of && and check it too */
    put_label(nexttestlab);
    codegen(rightchild(root), truelab, falselab, dropthrough);
    /* drop through may be to truelab; if not (because we're nested), jump to truelab */

    }
    break;

  case INPUT:
    input(root);
    break;

  case PRINT:
    codegen(leftchild(root), -1, -1, -1);
    print(root);
    break;

  case NEG:
  case NOT:
    codegen(leftchild(root), thenlab, elselab, donelab);
    monoperate("_t%d = %s_t%d;\n", root, c_infix_op[opsym(root)], leftchild(root));
    break;

  default:
    /* Be careful not to default anything other than binary operators! */
    if (arity[opsym(root)] != 3) {
      fprintf(stdout, "*** Not Implemented: codegen(%s)\n", name[opsym(root)]);
      break;
    }

    codegen(leftchild(root), thenlab, elselab, donelab);
    codegen(rightchild(root), thenlab, elselab, donelab);
    operate("_t%d = (_t%d %s _t%d);\n", root,
            leftchild(root), c_infix_op[opsym(root)], rightchild(root));
    {int op;
      // maybe a case statement would be more efficient here, or a range test
      if (((op = opsym(root)) == EQ) || (op == NE) || (op == LT) || (op == GT) || (op == LE) || (op == GE)) {
        if (thenlab != donelab) put_ifgoto(root, thenlab, TRUE);
        if (elselab != donelab) put_ifgoto(root, elselab, FALSE);
      }
    }
    break;
  }
}

Keywords: Let's build a compiler/How to write a compiler/How to write a simple compiler/Writing a simple compiler/Let's write a compiler/My first compiler/A trivial compiler/Compiler-writing tutorial/Compiler-writer primer/A primer on how to write a compiler.