LET'S BUILD A COMPILER!

                                By

                     Jack W. Crenshaw, Ph.D.

                           2 April 1989


                  Part VIII: A LITTLE PHILOSOPHY


*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************


INTRODUCTION

This is going to be a  different  kind of session than the others
in our series on  parsing  and  compiler  construction.  For this
session, there won't be  any  experiments to do or code to write.
This  once,  I'd  like  to  just  talk  with  you  for  a  while.
Mercifully, it will be a short  session,  and then we can take up
where we left off, hopefully with renewed vigor.

When  I  was  in college, I found that I could  always  follow  a
prof's lecture a lot better if I knew where he was going with it.
I'll bet you were the same.

So I thought maybe it's about  time  I told you where we're going
with this series: what's coming up in future installments, and in
general what all  this  is  about.   I'll also share some general
thoughts concerning the usefulness of what we've been doing.


THE ROAD HOME

So far, we've  covered  the parsing and translation of arithmetic
expressions,  Boolean expressions, and combinations connected  by
relational  operators.    We've also done the  same  for  control
constructs.    In  all of this we've leaned heavily on the use of
top-down, recursive  descent  parsing,  BNF  definitions  of  the
syntax, and direct generation of assembly-language code.  We also
learned the value of  such  tricks  as single-character tokens to
help  us  see  the  forest  through  the  trees.    In  the  last
installment  we dealt with lexical scanning,  and  I  showed  you
simple but powerful ways to remove the single-character barriers.

Throughout the whole study, I've emphasized  the  KISS philosophy
... Keep It Simple, Sidney ... and I hope by now  you've realized
just  how  simple  this stuff can really be.  While there are for
sure areas of compiler  theory  that  are truly intimidating, the
ultimate message of this series is that in practice you  can just
politely  sidestep   many  of  these  areas.    If  the  language
definition  cooperates  or,  as in this series, if you can define
the language as you go, it's possible to write down  the language
definition in BNF with reasonable ease.  And, as we've  seen, you
can crank out parse procedures from the BNF just about as fast as
you can type.

As our compiler has taken form, it's gotten more parts,  but each
part  is  quite small and simple, and  very  much  like  all  the
others.

At this point, we have many  of  the makings of a real, practical
compiler.  As a matter of  fact,  we  already have all we need to
build a toy  compiler  for  a  language as powerful as, say, Tiny
BASIC.  In the next couple of installments, we'll  go  ahead  and
define that language.

To round out  the  series,  we  still  have a few items to cover.
These include:

   o Procedure calls, with and without parameters

   o Local and global variables

   o Basic types, such as character and integer types

   o Arrays

   o Strings

   o User-defined types and structures

   o Tree-structured parsers and intermediate languages

   o Optimization

These will all be  covered  in  future  installments.  When we're
finished, you'll have all the tools you need to design  and build
your own languages, and the compilers to translate them.

I can't  design  those  languages  for  you,  but I can make some
comments  and  recommendations.    I've  already  sprinkled  some
throughout past installments.    You've  seen,  for  example, the
control constructs I prefer.

These constructs are going  to  be part of the languages I build.
I  have  three  languages in mind at this point, two of which you
will see in installments to come:

TINY - A  minimal,  but  usable  language  on the order  of  Tiny
       BASIC or Tiny C.  It won't be very practical, but  it will
       have enough power to let you write and  run  real programs
       that do something worthwhile.

KISS - The  language  I'm  building for my  own  use.    KISS  is
       intended to be  a  systems programming language.  It won't
       have strong typing  or  fancy data structures, but it will
       support most of  the  things  I  want to do with a higher-
       order language (HOL), except perhaps writing compilers.
                              
I've also  been  toying  for  years  with  the idea of a HOL-like
assembler,  with  structured  control  constructs   and  HOL-like
assignment statements.  That, in  fact, was the impetus behind my
original foray into the jungles of compiler theory.  This one may
never be built, simply  because  I've  learned that it's actually
easier to implement a language like KISS, that only uses a subset
of the CPU instructions.    As you know, assembly language can be
bizarre  and  irregular  in the extreme, and a language that maps
one-for-one onto it can be a real challenge.  Still,  I've always
felt that the syntax used  in conventional assemblers is dumb ...
why is

     MOVE.L A,B

better, or easier to translate, than

     B=A ?

I  think  it  would  be  an  interesting  exercise to  develop  a
"compiler" that  would give the programmer complete access to and
control over the full complement  of the CPU instruction set, and
would allow you to generate  programs  as  efficient  as assembly
language, without the pain  of  learning a set of mnemonics.  Can
it be done?  I don't  know.  The  real question may be, "Will the
resulting language be any  easier  to  write  than assembly"?  If
not, there's no point in it.  I think that it  can  be  done, but
I'm not completely sure yet how the syntax should look.

Perhaps you have some  comments  or suggestions on this one.  I'd
love to hear them.

You probably won't be surprised to learn that I've already worked
ahead in most  of the areas that we will cover.  I have some good
news:  Things  never  get  much  harder than they've been so far.
It's  possible  to  build a complete, working compiler for a real
language, using nothing  but  the same kinds of techniques you've
learned so far.  And THAT brings up some interesting questions.


WHY IS IT SO SIMPLE?

Before embarking  on this series, I always thought that compilers
were just naturally complex computer  programs  ...  the ultimate
challenge.  Yet the things we have done here have  usually turned
out to be quite simple, sometimes even trivial.

For awhile, I thought  is  was simply because I hadn't yet gotten
into the meat  of  the  subject.    I had only covered the simple
parts.  I will freely admit  to  you  that, even when I began the
series,  I  wasn't  sure how far we would be able  to  go  before
things got too complex to deal with in the ways  we  have so far.
But at this point I've already  been  down the road far enough to
see the end of it.  Guess what?
                              

                     THERE ARE NO HARD PARTS!


Then, I thought maybe it was because we were not  generating very
good object  code.    Those  of  you  who have been following the
series and trying sample compiles know that, while the code works
and  is  rather  foolproof,  its  efficiency is pretty awful.   I
figured that if we were  concentrating on turning out tight code,
we would soon find all that missing complexity.

To  some  extent,  that one is true.  In particular, my first few
efforts at trying to improve efficiency introduced  complexity at
an alarming rate.  But since then I've been tinkering around with
some simple optimizations and I've found some that result in very
respectable code quality, WITHOUT adding a lot of complexity.

Finally, I thought that  perhaps  the  saving  grace was the "toy
compiler" nature of the study.   I  have made no pretense that we
were  ever  going  to be able to build a compiler to compete with
Borland and Microsoft.  And yet, again, as I get deeper into this
thing the differences are starting to fade away.

Just  to make sure you get the message here, let me state it flat
out:

   USING THE TECHNIQUES WE'VE USED  HERE,  IT  IS  POSSIBLE TO
   BUILD A PRODUCTION-QUALITY, WORKING COMPILER WITHOUT ADDING
   A LOT OF COMPLEXITY TO WHAT WE'VE ALREADY DONE.


Since  the series began I've received  some  comments  from  you.
Most of them echo my own thoughts:  "This is easy!    Why  do the
textbooks make it seem so hard?"  Good question.

Recently, I've gone back and looked at some of those texts again,
and even bought and read some new ones.  Each  time,  I come away
with the same feeling: These guys have made it seem too hard.

What's going on here?  Why does the whole thing seem difficult in
the texts, but easy to us?    Are  we that much smarter than Aho,
Ullman, Brinch Hansen, and all the rest?

Hardly.  But we  are  doing some things differently, and more and
more  I'm  starting  to appreciate the value of our approach, and
the way that  it  simplifies  things.    Aside  from  the obvious
shortcuts that I outlined in Part I, like single-character tokens
and console I/O, we have  made some implicit assumptions and done
some things differently from those who have designed compilers in
the past. As it turns out, our approach makes life a lot easier.

So why didn't all those other guys use it?

You have to remember the context of some of the  earlier compiler
development.  These people were working with very small computers
of  limited  capacity.      Memory  was  very  limited,  the  CPU
instruction  set  was  minimal, and programs ran  in  batch  mode
rather  than  interactively.   As it turns out, these caused some
key design decisions that have  really  complicated  the designs.
Until recently,  I hadn't realized how much of classical compiler
design was driven by the available hardware.

Even in cases where these  limitations  no  longer  apply, people
have  tended  to  structure their programs in the same way, since
that is the way they were taught to do it.

In  our case, we have started with a blank sheet of paper.  There
is a danger there, of course,  that  you will end up falling into
traps that other people have long since learned to avoid.  But it
also has allowed us to  take different approaches that, partly by
design  and partly by pure dumb luck, have  allowed  us  to  gain
simplicity.

Here are the areas that I think have  led  to  complexity  in the
past:

  o  Limited RAM Forcing Multiple Passes

     I  just  read  "Brinch  Hansen  on  Pascal   Compilers"  (an
     excellent book, BTW).  He  developed a Pascal compiler for a
     PC, but he started the effort in 1981 with a 64K system, and
     so almost every design decision  he made was aimed at making
     the compiler fit  into  RAM.    To do this, his compiler has
     three passes, one of which is the lexical scanner.  There is
     no way he could, for  example, use the distributed scanner I
     introduced  in  the last installment,  because  the  program
     structure wouldn't allow it.  He also required  not  one but
     two intermediate  languages,  to  provide  the communication
     between phases.

     All the early compiler writers  had to deal with this issue:
     Break the compiler up into enough parts so that it  will fit
     in memory.  When  you  have multiple passes, you need to add
     data structures to support the  information  that  each pass
     leaves behind for the next.   That adds complexity, and ends
     up driving the  design.    Lee's  book,  "The  Anatomy  of a
     Compiler,"  mentions a FORTRAN compiler developed for an IBM
     1401.  It had no fewer than 63 separate passes!  Needless to
     say,  in a compiler like this  the  separation  into  phases
     would dominate the design.

     Even in  situations  where  RAM  is  plentiful,  people have
     tended  to  use  the same techniques because  that  is  what
     they're familiar with.   It  wasn't  until Turbo Pascal came
     along that we found how simple a compiler could  be  if  you
     started with different assumptions.


  o  Batch Processing
                              
     In the early days, batch  processing was the only choice ...
     there was no interactive computing.   Even  today, compilers
     run in essentially batch mode.

     In a mainframe compiler as  well  as  many  micro compilers,
     considerable effort is expended on error recovery ... it can
     consume as much as 30-40%  of  the  compiler  and completely
     drive the design.  The idea is to avoid halting on the first
     error, but rather to keep going at all costs,  so  that  you
     can  tell  the  programmer about as many errors in the whole
     program as possible.

     All of that harks back to the days of the  early mainframes,
     where turnaround time was measured  in hours or days, and it
     was important to squeeze every last ounce of information out
     of each run.

     In this series, I've been very careful to avoid the issue of
     error recovery, and instead our compiler  simply  halts with
     an error message on  the  first error.  I will frankly admit
     that it was mostly because I wanted to take the easy way out
     and keep things simple.   But  this  approach,  pioneered by
     Borland in Turbo Pascal, also has a lot going for it anyway.
     Aside from keeping the  compiler  simple,  it also fits very
     well  with   the  idea  of  an  interactive  system.    When
     compilation is  fast, and especially when you have an editor
     such as Borland's that  will  take you right to the point of
     the error, then it makes a  lot  of sense to stop there, and
     just restart the compilation after the error is fixed.


  o  Large Programs

     Early compilers were designed to handle  large  programs ...
     essentially infinite ones.    In those days there was little
     choice;  the  idea  of  subroutine  libraries  and  separate
     compilation  were  still  in  the  future.      Again,  this
     assumption led to  multi-pass designs and intermediate files
     to hold the results of partial processing.

     Brinch Hansen's  stated goal was that the compiler should be
     able to compile itself.   Again, because of his limited RAM,
     this drove him to a multi-pass design.  He needed  as little
     resident compiler code as possible,  so  that  the necessary
     tables and other data structures would fit into RAM.

     I haven't stated this one yet, because there  hasn't  been a
     need  ... we've always just read and  written  the  data  as
     streams, anyway.  But  for  the  record,  my plan has always
     been that, in  a  production compiler, the source and object
     data should all coexist  in  RAM with the compiler, a la the
     early Turbo Pascals.  That's why I've been  careful  to keep
     routines like GetChar  and  Emit  as  separate  routines, in
     spite of their small size.   It  will be easy to change them
     to read to and write from memory.


  o  Emphasis on Efficiency

     John  Backus has stated that, when  he  and  his  colleagues
     developed the original FORTRAN compiler, they KNEW that they
     had to make it produce tight code.  In those days, there was
     a strong sentiment against HOLs  and  in  favor  of assembly
     language, and  efficiency was the reason.  If FORTRAN didn't
     produce very good  code  by  assembly  standards,  the users
     would simply refuse to use it.  For the record, that FORTRAN
     compiler turned out to  be  one  of  the most efficient ever
     built, in terms of code quality.  But it WAS complex!

     Today,  we have CPU power and RAM size  to  spare,  so  code
     efficiency is not  so  much  of  an  issue.    By studiously
     ignoring this issue, we  have  indeed  been  able to Keep It
     Simple.    Ironically,  though, as I have said, I have found
     some optimizations that we can  add  to  the  basic compiler
     structure, without having to add a lot of complexity.  So in
     this  case we get to have our cake and eat it too:  we  will
     end up with reasonable code quality, anyway.


  o  Limited Instruction Sets

     The early computers had primitive instruction sets.   Things
     that  we  take  for granted, such as  stack  operations  and
     indirect addressing, came only with great difficulty.

     Example: In most compiler designs, there is a data structure
     called the literal pool.  The compiler  typically identifies
     all literals used in the program, and collects  them  into a
     single data structure.    All references to the literals are
     done  indirectly  to  this  pool.    At  the   end   of  the
     compilation, the  compiler  issues  commands  to  set  aside
     storage and initialize the literal pool.

     We haven't had to address that  issue  at all.  When we want
     to load a literal, we just do it, in line, as in

          MOVE #3,D0

     There is something to be said for the use of a literal pool,
     particularly on a machine like  the 8086 where data and code
     can  be separated.  Still, the whole  thing  adds  a  fairly
     large amount of complexity with little in return.

     Of course, without the stack we would be lost.  In  a micro,
     both  subroutine calls and temporary storage depend  heavily
     on the stack, and  we  have used it even more than necessary
     to ease expression parsing.


  o  Desire for Generality

     Much of the content of the typical compiler text is taken up
     with issues we haven't addressed here at all ... things like
     automated  translation  of  grammars,  or generation of LALR
     parse tables.  This is not simply because  the  authors want
     to impress you.  There are good, practical  reasons  why the
     subjects are there.

     We have been concentrating on the use of a recursive-descent
     parser to parse a  deterministic  grammar,  i.e.,  a grammar
     that is not ambiguous and, therefore, can be parsed with one
     level of lookahead.  I haven't made much of this limitation,
     but  the  fact  is  that  this represents a small subset  of
     possible grammars.  In fact,  there is an infinite number of
     grammars that we can't parse using our techniques.    The LR
     technique is a more powerful one, and can deal with grammars
     that we can't.

     In compiler theory, it's important  to know how to deal with
     these  other  grammars,  and  how  to  transform  them  into
     grammars  that  are  easier to deal with.  For example, many
     (but not all) ambiguous  grammars  can  be  transformed into
     unambiguous ones.  The way to do this is not always obvious,
     though, and so many people  have  devoted  years  to develop
     ways to transform them automatically.

     In practice, these  issues  turn out to be considerably less
     important.  Modern languages tend  to be designed to be easy
     to parse, anyway.   That  was a key motivation in the design
     of Pascal.   Sure,  there are pathological grammars that you
     would be hard pressed to write unambiguous BNF  for,  but in
     the  real  world  the best answer is probably to avoid those
     grammars!

     In  our  case,  of course, we have sneakily let the language
     evolve  as  we  go, so we haven't painted ourselves into any
     corners here.  You may not always have that luxury.   Still,
     with a little  care  you  should  be able to keep the parser
     simple without having to resort to automatic  translation of
     the grammar.


We have taken  a  vastly  different  approach in this series.  We
started with a clean sheet  of  paper,  and  developed techniques
that work in the context that  we  are in; that is, a single-user
PC  with  rather  ample CPU power and RAM space.  We have limited
ourselves to reasonable grammars that  are easy to parse, we have
used the instruction set of the CPU to advantage, and we have not
concerned ourselves with efficiency.  THAT's why it's been easy.

Does this mean that we are forever doomed  to  be  able  to build
only toy compilers?   No, I don't think so.  As I've said, we can
add  certain   optimizations   without   changing   the  compiler
structure.  If we want to process large files, we can  always add
file  buffering  to do that.  These  things  do  not  affect  the
overall program design.

And I think  that's  a  key  factor.   By starting with small and
limited  cases,  we  have been able to concentrate on a structure
for  the  compiler  that is natural  for  the  job.    Since  the
structure naturally fits the job, it is almost bound to be simple
and transparent.   Adding  capability doesn't have to change that
basic  structure.    We  can  simply expand things like the  file
structure or add an optimization layer.  I guess  my  feeling  is
that, back when resources were tight, the structures people ended
up  with  were  artificially warped to make them work under those
conditions, and weren't optimum  structures  for  the  problem at
hand.


CONCLUSION

Anyway, that's my arm-waving  guess  as to how we've been able to
keep things simple.  We started with something simple and  let it
evolve  naturally,  without  trying  to   force   it   into  some
traditional mold.

We're going to  press on with this.  I've given you a list of the
areas  we'll  be  covering in future installments.    With  those
installments, you  should  be  able  to  build  complete, working
compilers for just about any occasion, and build them simply.  If
you REALLY want to build production-quality compilers,  you'll be
able to do that, too.

For those of you who are chafing at the bit for more parser code,
I apologize for this digression.  I just thought  you'd  like  to
have things put  into  perspective  a  bit.  Next time, we'll get
back to the mainstream of the tutorial.

So far, we've only looked at pieces of compilers,  and  while  we
have  many  of  the  makings  of a complete language, we  haven't
talked about how to put  it  all  together.    That  will  be the
subject of our next  two  installments.  Then we'll press on into
the new subjects I listed at the beginning of this installment.

See you then.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************