This parser actually builds and uses multiple different trees. The grammar for a language is described in the .g grammar file. The first tree built is entirely internal: uparse.c initially creates an analysis record in an array named CST[] - the user never sees this tree and it is thrown away as soon as the user-level concrete syntax tree is built. The concrete syntax tree is built from the CST[] array using G_mktuple() and is a surface-level tree whose data is only the literal text of the source file, described using text descriptors which are indexes into the start and end of each string in the source text. We refer to the tree built using G_* symbols as the 'CST' and I apologise for the confusion that anyone reading the source code may find caused by the name of the array from which this was built being named 'CST' in the source code!) The ".g" grammar file can optionally contain code associated with each each parser rule, which is executed after a successful parse, as the parser (using the generated code in lang-ast.h) converts the CST() array to the Concrete version of the AST using G_mktuple() calls. This tree is what the user's code will walk when they wish to perform actions based on the concrete syntax tree. For many problems that require a grammar, for example a source code formatter, this tree may be all that is needed. Next, a default tree-builder is supplied as a template, which will convert this implicit tree (whose entries are indexed by G_* constants) to a default AST, in a 1:1 mapping of the G_* rules. This new tree is built using P_mktuple() and its tuples are indexed by P_* constants. (This tree is sometime referred to as the CAST rather than the AST because it's an AST that contains only the concrete syntax.) This phase doesn't have to do much processing, but some surface-level processing is usually wanted here, such as converting numbers from ascii text to an internal binary value, or standardising the case of variable names if the language is not case-sensitive. However these are actions that the user has to perform and they will usually be carried out by the C code that is embedded in the grammar for each rule. Although some applications may find working with this P_* tree more convenient than working with the G_* tree, in most cases both trees are so similar that the G_* tree is sufficient. The reason you would use this extra step is when you want to take advantage of the C code embedded in the grammar. You can modify the tuples that each rule returns so that instead of returning the default P_* tuple which simply reflects the source syntax, you can substitute an AST_* tuple of your own devising that contains data more appropriate to the internals of the language that you are parsing. For example in a programming language rather then returning a tuple that corresponds to a declaration "int i, j, k;" as a declaration tuple whose only field is a linked list of the variable names, you might prefer to return a sequence of three separate declaration tuples, one for each variable, as if the source code had contained "int i; int j; int k". It is up to the user to define the AST_* constants that index these user-defined tuples. If all pasrer rules return application-specific tuples in this fashion, then the resulting tree is what would usually be considered the resulting AST in most parsers. At this point it's more appropriate to talk about the 'AST' rather than the 'CAST' as it is now a properly abstract syntax tree. HOWEVER a significant advantage of the uparse system is that the actual tree nodes returned by G_mktuple and P_mktuple, or a user-defined AST_mktuple, are all stored in the same data structure and can be identified at run-time simply by examining the operator type field of the tuple (the ranges of G_*, P_*, and AST_* constants are arranged to be non-overlapping). (The name of the array in the parser source files containing these trees is AST(). There are only two C-style arrays in the code - 'CST[]' and 'AST[]'. The subtleties of nomenclature that caused the confusion mentioned above didn't come until late during development of the code. In the documentation, it's best to ignore the names of the arrays in the C source code and just talk about the functionality of CST/CAST/AST which are all stored in te AST[] array, or less ambiguously, about G_*, P_* and AST_* tuples....) This allow you to mix concrete syntax (G_* tuples, where the contents are just text and have no semantic tagging) with P_* tuples (whose tuples reflect the syntax of the language being parsed) with AST_* tuples (whose tuples reflect the semantics of the language being parsed). This mixing of levels allows for incremental development of your language processor and often run-time optimisation as well, for example in C, if you have a statement "if (constant condition that is always false) { ... }", you could avoid the G_* to P_* to AST_* processing for everything in the "{ ... }" block as it would never be executed. This mixed-level tree structure allows the user to write a utility that walks the returned P_* tuple tree and incrementall convert nodes to an application-specific AST, which is accessed by user-created AST_* symbols. A multi-pass compiler might optimise the translated code in this manner, with the intermediate stages holding tuples that represented, for example, LLVM internal code, and the final rewritten tree nodes finally containing tuples which correspond to the actual binary machine instructions of the target architecture. ------- There are some implementation subtleties that do have to be explained for people working with this code. They're well enough commented in the uparse sources but I think I need to mention some of them here so new users don't have to read all the source code before they can use the parser. Read this after you've read the main project README and maybe after you've tried writing an initial small test grammar. By the way, your initial test should be a grammar file that contains only rules, not C code - building this with the default options will create a syntax checker for your language. I strongly recommend getting that to fully work before you start adding C code to the grammar to actually do something useful with the files you are parsing. Your first test with added C code might be to simply re-output your input file but with re-created indentation. *Then* you can go on to making a proper application with it. Most arrays in the parser source that can hold arbitrary amounts of data are 'flex' arrays, i.e. they grow as needed. To enable this, I had to take some liberties with the source code conventions. To write to an element of a flex array, you need to use a form of the macro that is specific to the LHS of an assignment, eg: _AST(new_index) = expression; whereas accessing an exiting element is simpler: variable = AST(expression); The _xxx(idx) form expands the array if the index is outside the current bounds (filling any gaps with a pattern that can be used to detect access to an unassigned variable), but the unadorned xxx(idx) form issues an error message about the index being outside the current array bounds. This error detection can be extremely helpful! Don't turn it off unless you have to. The arrays may also be defined to allow arbitrary lower bounds, they don't have to be based at 0. Note that these arrays may be relocated in store when they are extended. Because of this it is important that you don't include pointers in the array - use an index value instead. This restriction has never caused any difficulty, and actually is often a better practice than storing a true pointer would have been. ( Aside: although I come from a programming background where arrays were represented as xxx(idx) rather than xxx[idx], I do appreciate that having to follow this convention for C programmers may be a difficult bite to swallow. So I have built a variation of the cpp preprocessor which allows the use of [] brackets in parameters instead of () ones, for example: cpp: #define xxx(dimen,idx) xxx[RANGECHECK(dimen,0,MAX_DIMEN)][RANGECHECK(idx,-10,42)+10] var = xxx(0,-4); vs gtcpp: $define xxx[dimen][idx] xxx[RANGECHECK(dimen,0,MAX_DIMEN)][RANGECHECK(idx,-10,42)+10] var = xxx[0][-4]; This preprocessor is not yet used in the uparse system. ) The AST array/tree is an array of plain integers! It is *not* an array of structs. Although superficially you might expect a little loss of consistency checking when accessing a tree node with the risk of accessing an incorrect type of node, it's easy enough to add internal consistency checking to applications code, that this too has never been a problem - and the simplicity it adds to the code is a major win. ----------- Parsing: this is all explained elsewhere so Cliff Notes summary only: The C code embedded in the parser is only executed after a successful parse. This makes things like making parsing decsions on the fly rather difficult - for example C typedef statements effectively change the syntax of the language - some constructs can be parsed two ways depending on whether a particular token is a variable name or a typedef name. To work around these problems (which are thankfully rather few) there is a mechanism in the parser which allows C code to be executed *while parsing*. This is a tricky construct to get right and best avoided. It's explained more fully in the main README. Escaping of literal strings in the parser .g file is tricky, as the escaping applies to what the parser generator reads. So sometimes double-escaping is necessary if the parsed string ends up embedded in some other piece of C code. You may have a bit of trial and error to get those right. The grammar file is a two-level parser, it handles lex-style token parsing using regular expressions. The syntax of the regular expressions is to enclose them in french-style chevrons. This was chosen to avoid having to add yet another level of quote escaping. But it does make the .g files rather hard to edit. Sorry. Also what the regular expressions accept is not 100% clear. There is a little ambiguity with how line ends are handled. Best to crib from my examples rather than try to work out what the heck is going on. Note that the regular expression code has been modified to fully support Unicode. The parser was originally designed to be able to parse files connected in virtual memory without having to read them in to RAM first. This was the primary reason that tokens returned by the parser consist of a pair of indexes, to the start (inclusive) and end (exclusive) of the token within that file. This was great until I tried to parse Algol 60 which is a language where keywords are stropped by surrounding them with 'quote' characters. And the stropping details mean that 'GO TO', 'GOTO', and 'GO' 'TO' are all considered the same. (Incidentally because spaces are ignored in languages like that, these would also be allowed: 'G' 'O' 'T' 'O', 'G' 'OTO', etc.) This is only a problem for modern parsing systems like the combination of yacc+lex: older parsers took it in their stride because they performed a pre-processing phase called 'line reconstruction' on their input, which would remove comments, spaces, and reduce stropped keywords to a canonical representation (usually the upper case text in 7-bit ASCII, with the top bit set...). To accommodate this in uparse, the source code is no longer accessable as a memory-mapped file - it is first read in to an actual array in RAM with any pre-processing needed being done by user-supplied code embeded in the .g file as an initial block before the grammar rules. This has a knock-on effect with the regular expression lexer which can no longer assume that the original input is ready to be accessed. Due to subtleties in mixed-language parsing, for example, this line-reconstruction pre-processing can actually change in mid-parse depending on context! So the lexer is not passed the data in advance - it has to request the data piece by piece as it is lexing, by the use of a callback to the parser which in turn calls back to the line reconstruction code. Boy, is it complicated to follow :-( Anyway, it's in there, it's not well documented, most people won't ever need it, but you had better not muck about with it if you don't fully understand it or something critical will break. The regular expression lexer incidentally optimised lexing by pre-compiling the regular expressions. It does not go as far as memo-izing *calls* to the regular expression code, but could do so if efficiency considerations warranted it. (the call to memo-ize would be the combination of regex and offset into the source array, coupled with the multiple results of the regular expression call.) The mechanism of this parser (as explained in the main document) is to use a set of ordered choices in a left-recursive top-down recursive-descent parse. For compactness and efficiency the parser is table-driven but the parsing algorithm can work just as well from a parser generator which creates procedures to parse each rule. (I have an older parser generator called 'tacc' which does exactly that. table-driven is better.) A hint: several of my own projects simply manipulate their input language, whether it's a formatter or a source-to-source translator. Projects like this have to parse a language, but they also have to handle - and by that I mean retain and possibly re-output in a modified form - white space and comments, which regular compilers tend to throw away. I've found that the simplest way to handle these extra items is to make the tokens that the parser finds include both the actual token and any preceding white space. For most applications, three pointers will do the trick: the first is an index to the start of the preceding white space, the second is an index to the first character of the actual token, and the third is the pointer to the byte following the final character of the actual token (because range pointers should always be lower-bound inclusive, upper-bound exclusive - trust me on this even if it's not obvious to you as to why at this point) However there are some languages where you are allowed white space - and even comments - *inside* a single token, eg Algol 60 has spaces in variable names and Imp77 allows comments anywhere between {} brackets, that are entirely removed before the compiler ever sees the token, as in: Algol60: This is a variable := 1; Imp77: This is a variable {but with an embedded comment!} too = 1 If you find yourself in this situation, you'll need to modify the token code to record A) the whitespace and comments before the token; B) the token including white space and comments inside the token; and most importantly, C) the canonicalised token itself without whitespace or embedded comments; and maybe even D) the preferred representation of the token with spaces and capitalisation that should be used when generating source that references that token (eg if you have a program that refers to 'List', 'list' and 'LIST' in various places but they are all the same variable, you might want to examine all the occurances and output only the most common form in generated output.)