Scanner and Parser for "C"

(a.d. 2004, by DoDi)

This Delphi(4/7) project provides an scanner and parser for C source files, and an simple translator from C to Pascal. Information about scanning and parsing C and about the usage of the supplied projects and source files is scattered into several related documents: The impatient readers will (hopefully ;-) find the most important informations below.

The WinToPas Project

This Delphi project is used during the implementation and debugging of the CScan scanner and CParse parser for C source files. The most interesting application of WinToPas.exe currently is the import of C header files, so that C libraries can be used in Delphi or other (FreePascal...) development environments.

This is not a end-user project, the GUI is quite spartanic, but you can improve it yourself and contribute your efforts to this project. I for my part concentrate on the technical support of the scanner, parser, and on the Pascal output module. The program also is not reentrant,  you should exit and restart before parsing the same or a different project.

The File Menu

"Select" allows to select the file to be processed. You also can enter the full file name into the File field.
"Preprocess" produces an preprocessed *.i text file from the selected file.
"Parse" parses the selected file and creates an *.txt intermediate file.
"Convert Types" creates an *.pas unit from the *.txt intermediate file.
"Save Metadata" saves the *.txt intermediate file.
-
"Stop" may be used to abort the scanner or parser, in case of problems.
"Exit" should just do that ;-)

The View Menu

provides several dumps of the intermediate data, after parsing:

"Config" shows the just used *.def files.
"Macros" shows all defined macros.
"Scopes" shows the parsed modules, and the global symbol table.
"Symbols" shows all the macros and symbols.
"Types" shows all typedefs.

The Test Menu

This menu only contains debug commands:

"Scanner" invokes the scanner for scantest.c.
"Parser" will parse the file parsetest.c.

Quickstart

If you may want to convert existing C header files into Delphi import units, please follow these steps:
  1. edit your main project file (the default file is wintest.c)
  2. run WinToPas and Select your file
  3. select Parse to create the intermediate file
  4. select Convert Types to produce the *.pas unit.
If you want to convert e.g. sample.h into sample.pas, create sample.c as your project file. The project file must contain at least the search path for the header files and a reference to the source file, e.g.:

#pragma Include "c:\mycompiler\include\"
#include "c:\myprojects\sampleproject\sample.h"

Unfortunately such a simple setup may not be sufficient, and you'll have to use a more detailed setup. Then have a look at the supplied compiler specifications (e.g. BCB4.def, VC7.def). The VC7.def file may be used for various Microsoft C compilers. If you found an appropriate definition file, then move the #pragma Include line into the user.def file, and #include the *.def file instead. E.g.:

#include "VC7.def"
#include "c:\myprojects\sampleproject\sample.h"

If this still doesn't work, please read the instructions and explanations in the following sections.

Of course a set of standard header files, shipped with the program, would simplify the use of the converter. Any contributions? ;-)

The Project Files

The parser must know which file(s) to process, and which compiler to emulate. All project files must be valid C source files, which are #included into the project main file. The supplied files can be used as templates for your own projects, and the modifications should not require too much knowledge of C, I hope. The file extensions should reflect the contents of the files, e.g.:
*.c for C source files
*.def for settings specific to a project, compiler or installation
*.<whatever you like> for other settings

The names of the created files are derived from the project file, which you Selected in the Open dialog or entered manually in the edit box at the top of the WinToPas main window. The following files can be created from e.g. sample.c:
sample.txt - the intermediate file
sample.pas - the converted Pascal unit
sample.i - the preprocessed source file (traditional mode)
sample.l - a more verbose listing of the preprocessed file

Let's start with wintest.c, the default project file. In the supplied version this file contains a list of compiler tests, similar to a Delphi project group file. Only one of these projects should be enabled at the same time.
In the BCB4test.c file you'll find two #includes, the first specifying the compiler, and the second specifying the source file(s) to process.
The BCB4.def file contains the settings which I found required to parse Windows.h as supplied with the Borland C++ Builder version 4. The supplied *.def files should need no modifications, except for the user.def file:
The user.def file contains the search pathes to your installed compiler(s). Please update all pathes as appropriate for your system. The compiler must not really be installed, it's sufficient that the directories with the compiler specific header files (*.h) exist somewhere on your disks.
The W32.def file contains some required settings for most Windows.h versions, regardless of any specific compiler.

If you want to add another compiler definition, please follow these guidelines for the according file structure:
<compiler><version>.def contains the version specific settings, and then includes the <compiler>.def file.
<compiler>.def contains all the #defines that are assumed (preset) by the compiler.
Of course you also can specify all settings in a single file, and don't worry about a separation into compiler family and version specific settings for now.

Pragmas

#pragma directives are reserved for compiler specific purpose. For the C parser I added a #pragma Include for the specification of the search path for #include files. This pragma can have multiple arguments, where the first argument specifies an absolute path to the include directory, optionally followed by path fragments with references to subdirectories, in the form "+<subdir>". Typically standard header files in subdirectories are referred to by #include <sys\types.h>, so that the file is searched in the given subdirectory relative to the directories in the include path, but sometimes such a subdirectory specifier may be missing. Then you must locate that header file on your disk, and add the directory to the search path. If the directory is a subdirectory of an already specified directory in a #pragma Include, you can use the short syntax to append the subdirectory to that pragma.
More #pragma lines can be added for different directories (cumulative).

Some Notes

Be careful with the VC version number, newer VC versions (>600) are very incompatible with C standards! Unless really required, I don't want to bloat the scanner and parser with compiler specific extensions.

I've been not very successfull with an set of Linux header files (gnutest.c), some header files seem to be missing. But besides for the missing #defines and type declarations, the parser seems to digest also the gcc specific constructs and aberrations from the C standards. The parser really should be tested on a working Linux system...

Now you can run WinToPas and Parse the selected file. Be patient, parsing Windows.h and all further #included files can take some minutes, and the created wintest.txt file can require more than 2 MB on your hard disk. [ToDo: turn off progress log] Fortunately this lengthy operation is required only once, in most cases. In case of problems you may Stop the parser, edit the definitions in wintest.c, and restart the parser.

Once a file with all declarations has been produced, you can invoke the type importer with Convert Types. Now the declarations in wintest.txt are converted into an wintest.pas unit, in a few seconds.

Now you can use the created *.pas file in you own projects. Some circumstances may require editing of this file, because some differences between C and Pascal can not be resolved automagically, at least they can require much more conversion code.

Known Problems

Some of the detected problems are flagged with {???} or other comments in the created .pas file. The compiler may issue error messages around these places, and the comments may give you an idea of what is wrong. Then it's up to you to cure the problems, or to comment out the offending lines.

AFAIK Delphi 4 cannot call C procedures with a variable number of arguments, indicated in the C code by "..." or "va_list". Newer versions seem to support this construct.

C bitfields also have no equivalent in Delphi. An intended solution is a change of the Record type into Object, with added properties and methods to emulate the access to bitfields.

In earlier C standards nested structs and unions had to be given explicit names. This seems no more required nowadays(?), and in some situation the use of variant records is compatible with the omission of such member names. But at least two situations require more knowledge or modifications of the code:

Delphi only allows for unions (variant record parts) at the end of an Record. Currently unnamed unions in the middle of an Record, followed by further fields, produce compiler errors. In these cases not only the misplaced unions have to be given some name and explicit Record types, but these names also have to be inserted into cross-compiled code on every access to these nested fields.

One general problem can occur with the alignment in complex data structures. Such mismatches require a careful study of the original C code and the intended compiler; it's impossible to create properly aligned record layouts without knowledge about the struct layout as produced by the C compiler. No such problems can occur when the whole C code is translated into Pascal code, since then the Pascal compiler will produce an unique layout of the Record types.

But C does not only allow for the omission of member names, but also for the omission of type declarations, in detail of struct declarations. In such cases the undefined structures only should be passed as untyped references (Const/Var parameters), and no references to the unknown Record fields are possible.

The evaluation of "sizeof" currently results in a constant value (-1), which may result in negative array dimensions. This value was choosen to produce compiler errors for such constructs. In some future version the expressions may be converted into valid Pascal expressions, so that the compiler can evaluate "sizeof" and other macros.

Also related to the evaluation of constant expressions is the evaluation of the initializers for complex data types. Currently only the values of simple (ordinal) constants are evaluated properly, whereas variables or constants of string or more complex data types are typically represented as "= 0", resulting in syntax errors.

A last known problem can arise from the ordering of type declarations. C compilers are somewhat lazy, and let it up to the linker to resolve missing or misplaced type declarations. In proper Pascal code instead Forward declarations must be inserted, or the declarations must be reordered to circumvent such forward declarations.

ToDo List

Some improvements still have to be implemented in the type converter. In most cases it's not yet clear how to cure some of the known disabilities and omissions, these will hopefully be handled in some future version.

The handling of C macros (#defines) will be improved in one of the next versions. Currently all macros are expanded by the preprocessor, so that the names of all #defined constants are replaced by their defined values. This simplifies the evaluation of constant expressions, but in many cases it's desireable to retain the constant names or function calls in the converted code. A possible solution is described in a related document.

Some more problems with macros:

Macros with empty bodies can cause problems in conditional expressions, when a value is required. Currently the macro is expanded to "" (nothing) instead of an zero value "0". In some header files this situation is handled with constructs like:
  #if sym + 0 > 5
where the "+" is interpreted as an unary or binary operator, depending on whether "sym" expands to nothing or to some value.
A proper solution would require that the expression parser can determine when a macro expands to nothing, so that a synthetic value token (zero constant) can be substituted.

The names of macro arguments are not checked for language conflicts in uToPas. In fact currently the macro text is copied into the output file without tokenization, so that no name substitution will ever occur. This should be changed in some future version, when macros really can be treated as procedures or functions.

Cross Compiler

In the context of the C-To-Pascal translator, a Cross Compiler  is a translator from one programming language into another one, on souce code level, not a compiler with binary output files.

One step beyond the header converter is already implemented, now also the code in procedures is parsed and translated. Nonetheless the converter still does a 1:1 translation, with only minor modifications to the code, like collecting local variables into a single Var section.

An improved translator will work with the parse trees, created from the C source files. The transformation of these trees requires different procedures, or even a different language, aimed at list processing. So the cross compiler will become a very new project, related to the current translator only by the use of the intermediate files, that are produced by WinToPas.

The intended cross compiler shall be focused on GNU and other Open Source C code, so that the huge C source library can be converted into "native" Pascal code. Many facettes of such an translator can be implemented in self-contained projects, and I'll do my best to provide all required information and an interface for third-party projects. Any assistance is welcome!