1. Design notes
In the late 50's and early 60's in compilers for Algol-derived languages, there was a phase before syntax analysis called 'line reconstruction' where stropped keywords were reduced to lower case and variables were reduced to upper case; all spaces except in strings were removed, and in line-terminated languages any statements broken over more than one line (usually using some sort of continuation character at the end of the broken line) were joined up into a single line.
For example the Algol statement:
'IF' a < b 'THEN' a := b
would be reduced to:
It was this line reconstruction process which allowed these languages to do without the need for a lexer or tokenizer, because the top-down recursive-descent table-driven parsers would recognise both the grammar structure and the actual keywords, one character at a time.
(Just for the record, the compilers I've used didn't actually save keywords as lower-case; in fact they set the top-bit as an indicator, since the character set was 7-bit ISO. This allowed mixed-case significance for identifiers, if needed.)
One of the advantages of line reconstruction on a stropped language is that the grammar does not need to include spaces between tokens, while managing to avoid the need for a tokenizer. Unfortunately we can't take advantage of that so our grammar has to explicitly allow spaces between tokens. The simplest way to guarantee a space is allowed at every point in the source is to insert them in the grammar only before literals. In fact a smarter parser-generator could do that for us transparently.
Also, since this Tiny language is of the c-/pascal-style tokenized variety, with reserved keywords, we cannot use a classic line-reconstruction technique; yet in order to use a TDRD parser, we need some sort of line reconstruction.
So what I have chosen to do in this compiler is to do the line-reconstruction in the pre-processor, and to remove extraneous spaces so that when we parse a space, we don't need to parse multiple spaces. In order to preserve accurate line numbers for diagnostics, lines which are removed from the source because of preprocessor conditions are output as blank lines, and blank lines will be explicitly skipped in the grammar.
I have a pre-processor from an earlier project which I can adapt to use here. You can read up on its cool use of this OperatorPrecedenceHack.
The next trick that may be new to some readers is that we do not encode the precedence in the grammar. Nor do we distinguish between boolean expressions and integer expressions at parse time. We simply parse anything that could remotely look like an expression, and if parts of it were not appropriate in this context then an error can be generated by looking at the annotated parse tree. Not only does this mean easier parsing with less backtracking, it also allows for more meaningful error messages.
Another closely related trick is that procedure calls, assignments, and array accesses all start with the same parsing structure, and the determination of what we are looking at is made after parsing.
I may make the '=' of assignment an operator and not have a specific grammar entry for assignment - just for <expr> with a check that there's a '=' at the top of the tree.
In fact I've been toying with an idea that Rainer and I discussed a couple of years ago, which is that *everything* is treated as an operator, and we use the same operator-precedence algorithm to restructure the AST trie for things like postfix array accesses as we do for arithmetic and boolean operators.