This parser actually builds and uses multiple different trees.

The grammar for a language is described in the <language>.g grammar file.

The first tree built is entirely internal: uparse.c initially creates an analysis record in
an array named CST[] - the user never sees this tree and it is thrown away as soon as the user-level
concrete syntax tree is built.

The concrete syntax tree is built from the CST[] array using G_mktuple() and is a surface-level
tree whose data is only the literal text of the source file, described using text descriptors
which are indexes into the start and end of each string in the source text.   We refer to
the tree built using G_* symbols as the 'CST' and I apologise for the confusion that anyone
reading the source code may find caused by the name of the array from which this was built
being named 'CST' in the source code!)

The ".g" grammar file can optionally contain code associated with each each parser rule, which
is executed after a successful parse, as the parser (using the generated code in lang-ast.h)
converts the CST() array to the Concrete version of the AST using G_mktuple() calls.  This tree
is what the user's code will walk when they wish to perform actions based on the concrete syntax
tree.  For many problems that require a grammar, for example a source code formatter, this tree
may be all that is needed.

Next, a default tree-builder is supplied as a template, which will convert this implicit tree
(whose entries are indexed by G_* constants) to a default AST, in a 1:1 mapping of the G_* rules.
This new tree is built using P_mktuple() and its tuples are indexed by P_* constants.

(This tree is sometime referred to as the CAST rather than the AST because it's an AST that
contains only the concrete syntax.)

This phase doesn't have to do much processing, but some surface-level processing is usually
wanted here, such as converting numbers from ascii text to an internal binary value, or standardising
the case of variable names if the language is not case-sensitive.  However these are actions
that the user has to perform and they will usually be carried out by the C code that is
embedded in the grammar for each rule.

Although some applications may find working with this P_* tree more convenient than working with
the G_* tree, in most cases both trees are so similar that the G_* tree is sufficient.  The reason
you would use this extra step is when you want to take advantage of the C code embedded in the
grammar.  You can modify the tuples that each rule returns so that instead of returning the
default P_* tuple which simply reflects the source syntax, you can substitute an AST_* tuple
of your own devising that contains data more appropriate to the internals of the language that
you are parsing.  For example in a programming language rather then returning a tuple that
corresponds to a declaration "int i, j, k;" as a declaration tuple whose only field is a linked
list of the variable names, you might prefer to return a sequence of three separate declaration
tuples, one for each variable, as if the source code had contained "int i; int j; int k".
It is up to the user to define the AST_* constants that index these user-defined tuples.
If all pasrer rules return application-specific tuples in this fashion, then the resulting tree
is what would usually be considered the resulting AST in most parsers.  At this point it's
more appropriate to talk about the 'AST' rather than the 'CAST' as it is now a properly
abstract syntax tree.

HOWEVER a significant advantage of the uparse system is that the actual tree nodes returned
by G_mktuple and P_mktuple, or a user-defined AST_mktuple, are all stored in the same data
structure and can be identified at run-time simply by examining the operator type field of
the tuple (the ranges of G_*, P_*, and AST_* constants are arranged to be non-overlapping).

(The name of the array in the parser source files containing these trees is AST().  There are
only two C-style arrays in the code - 'CST[]' and 'AST[]'.  The subtleties of nomenclature that
caused the confusion mentioned above didn't come until late during development of the code. In
the documentation, it's best to ignore the names of the arrays in the C source code and
just talk about the functionality of CST/CAST/AST which are all stored in te AST[] array,
or less ambiguously, about G_*, P_* and AST_* tuples....)

This allow you to mix concrete syntax (G_* tuples, where the contents are just
text and have no semantic tagging) with P_* tuples (whose tuples reflect the syntax of the
language being parsed) with AST_* tuples (whose tuples reflect the semantics of the language
being parsed).  This mixing of levels allows for incremental development of your language
processor and often run-time optimisation as well, for example in C, if you have a statement
"if (constant condition that is always false) { ... }", you could avoid the G_* to P_* to AST_*
processing for everything in the "{ ... }" block as it would never be executed.

This mixed-level tree structure allows the user to write a utility that walks the returned P_* tuple
tree and incrementall convert nodes to an application-specific AST, which is accessed by user-created
AST_* symbols.  A multi-pass compiler might optimise the translated code in this manner, with the
intermediate stages holding tuples that represented, for example, LLVM internal code, and the
final rewritten tree nodes finally containing tuples which correspond to the actual binary machine
instructions of the target architecture.


-------

There are some implementation subtleties that do have to be explained for people working with
this code.  They're well enough commented in the uparse sources but I think I need to mention
some of them here so new users don't have to read all the source code before they can use the
parser.  Read this after you've read the main project README and maybe after you've tried
writing an initial small test grammar.

By the way, your initial test should be a grammar file that contains only rules, not C code - building
this with the default options will create a syntax checker for your language.  I strongly recommend
getting that to fully work before you start adding C code to the grammar to actually do something
useful with the files you are parsing.  Your first test with added C code might be to simply
re-output your input file but with re-created indentation.  *Then* you can go on to making a
proper application with it.


Most arrays in the parser source that can hold arbitrary amounts of data are 'flex' arrays, i.e.
they grow as needed.  To enable this, I had to take some liberties with the source code conventions.
To write to an element of a flex array, you need to use a form of the macro that is specific to
the LHS of an assignment, eg:

   _AST(new_index) = expression;

whereas accessing an exiting element is simpler:

  variable = AST(expression);

The _xxx(idx) form expands the array if the index is outside the current bounds (filling any gaps
with a pattern that can be used to detect access to an unassigned variable), but the unadorned xxx(idx) form
issues an error message about the index being outside the current array bounds.  This error detection can be
extremely helpful!  Don't turn it off unless you have to.

The arrays may also be defined to allow arbitrary lower bounds, they don't have to be based at 0.

Note that these arrays may be relocated in store when they are extended.  Because of this it is important
that you don't include pointers in the array - use an index value instead.  This restriction has never
caused any difficulty, and actually is often a better practice than storing a true pointer would have been.

(
Aside: although I come from a programming background where arrays were represented as xxx(idx) rather than
xxx[idx], I do appreciate that having to follow this convention for C programmers may be a difficult bite
to swallow.  So I have built a variation of the cpp preprocessor which allows the use of [] brackets in
parameters instead of () ones, for example:


cpp:
      #define xxx(dimen,idx) xxx[RANGECHECK(dimen,0,MAX_DIMEN)][RANGECHECK(idx,-10,42)+10]
      var = xxx(0,-4);
vs

gtcpp:
      $define xxx[dimen][idx] xxx[RANGECHECK(dimen,0,MAX_DIMEN)][RANGECHECK(idx,-10,42)+10]
      var = xxx[0][-4];

This preprocessor is not yet used in the uparse system.
)

The AST array/tree is an array of plain integers!  It is *not* an array of structs.  Although superficially
you might expect a little loss of consistency checking when accessing a tree node with the risk of accessing
an incorrect type of node, it's easy enough to add internal consistency checking to applications code, that
this too has never been a problem - and the simplicity it adds to the code is a major win.

-----------

Parsing: this is all explained elsewhere so Cliff Notes summary only:

The C code embedded in the parser is only executed after a successful parse.  This makes things like
making parsing decsions on the fly rather difficult - for example C typedef statements effectively change
the syntax of the language - some constructs can be parsed two ways depending on whether a particular
token is a variable name or a typedef name.  To work around these problems (which are thankfully rather
few) there is a mechanism in the parser which allows C code to be executed *while parsing*.  This is a
tricky construct to get right and best avoided.  It's explained more fully in the main README.

Escaping of literal strings in the parser .g file is tricky, as the escaping applies to what the
parser generator reads.  So sometimes double-escaping is necessary if the parsed string ends up
embedded in some other piece of C code.  You may have a bit of trial and error to get those right.

The grammar file is a two-level parser, it handles lex-style token parsing using regular expressions.
The syntax of the regular expressions is to enclose them in french-style chevrons.  This was chosen
to avoid having to add yet another level of quote escaping.  But it does make the .g files rather
hard to edit.  Sorry.  Also what the regular expressions accept is not 100% clear.  There is a little
ambiguity with how line ends are handled.  Best to crib from my examples rather than try to work out
what the heck is going on.  Note that the regular expression code has been modified to fully support
Unicode.

The parser was originally designed to be able to parse files connected in virtual memory without having
to read them in to RAM first.  This was the primary reason that tokens returned by the parser consist of
a pair of indexes, to the start (inclusive) and end (exclusive) of the token within that file.

This was great until I tried to parse Algol 60 which is a language where keywords are stropped by
surrounding them with 'quote' characters.  And the stropping details mean that 'GO TO', 'GOTO', and 'GO' 'TO'
are all considered the same.  (Incidentally because spaces are ignored in languages like that, these would
also be allowed: 'G' 'O' 'T' 'O', 'G' 'OTO', etc.)  This is only a problem for modern parsing systems
like the combination of yacc+lex: older parsers took it in their stride because they performed a pre-processing
phase called 'line reconstruction' on their input, which would remove comments, spaces, and reduce stropped
keywords to a canonical representation (usually the upper case text in 7-bit ASCII, with the top bit set...).

To accommodate this in uparse, the source code is no longer accessable as a memory-mapped file - it is first
read in to an actual array in RAM with any pre-processing needed being done by user-supplied code embeded
in the .g file as an initial block before the grammar rules.

This has a knock-on effect with the regular expression lexer which can no longer assume that the original
input is ready to be accessed.  Due to subtleties in mixed-language parsing, for example, this line-reconstruction
pre-processing can actually change in mid-parse depending on context!  So the lexer is not passed the data
in advance - it has to request the data piece by piece as it is lexing, by the use of a callback to the
parser which in turn calls back to the line reconstruction code.  Boy, is it complicated to follow :-(
Anyway, it's in there, it's not well documented, most people won't ever need it, but you had better not
muck about with it if you don't fully understand it or something critical will break.

The regular expression lexer incidentally optimised lexing by pre-compiling the regular expressions.  It
does not go as far as memo-izing *calls* to the regular expression code, but could do so if efficiency
considerations warranted it.  (the call to memo-ize would be the combination of regex and offset into
the source array, coupled with the multiple results of the regular expression call.)

The mechanism of this parser (as explained in the main document) is to use a set of ordered choices in
a left-recursive top-down recursive-descent parse.  For compactness and efficiency the parser is table-driven
but the parsing algorithm can work just as well from a parser generator which creates procedures to parse
each rule. (I have an older parser generator called 'tacc' which does exactly that.  table-driven is better.)


A hint: several of my own projects simply manipulate their input language, whether it's a formatter
or a source-to-source translator.  Projects like this have to parse a language, but they also have to
handle - and by that I mean retain and possibly re-output in a modified form - white space and comments,
which regular compilers tend to throw away.  I've found that the simplest way to handle these extra
items is to make the tokens that the parser finds include both the actual token and any preceding
white space.  For most applications, three pointers will do the trick: the first is an index to the start
of the preceding white space, the second is an index to the first character of the actual token, and the
third is the pointer to the byte following the final character of the actual token (because range pointers
should always be lower-bound inclusive, upper-bound exclusive - trust me on this even if it's not obvious
to you as to why at this point)

However there are some languages where you are allowed white space - and even comments - *inside* a single
token, eg Algol 60 has spaces in variable names and Imp77 allows comments anywhere between {} brackets,
that are entirely removed before the compiler ever sees the token, as in:

Algol60:
   This is a variable := 1;

Imp77:
   This is a variable {but with an embedded comment!} too = 1

If you find yourself in this situation, you'll need to modify the token code to record A) the whitespace
and comments before the token; B) the token including white space and comments inside the token; and
most importantly, C) the canonicalised token itself without whitespace or embedded comments; and maybe
even D) the preferred representation of the token with spaces and capitalisation that should be used
when generating source that references that token (eg if you have a program that refers to 'List',
'list' and 'LIST' in various places but they are all the same variable, you might want to examine all
the occurances and output only the most common form in generated output.)