We are now at the point where declarations look about right and the type information from a declaration is linked to the variable name via the symbol tables. However that information is not in an especially usable form and currently needs much too much specific case code to extract when needed. Next stage of development will therefore be to simplify the type information structures in light of how they're used (or abused) in the first iteration of a working transpiler. Language variation notes: See several 'EMAS2900 to EMAS3' notes in http://www.ancientgeek.org.uk/EMAS/ERCC_User_Notes/ERCC_User_Note_058.pdf In EMAS 2900 IMP80, the %array %map ARRAY can take the name of an existing array as its second parameter, the format of that array then being used by ARRAY. This is not permitted in EMAS-3 IMP: the second parameter of ARRAY must be an explicitly declared %array %format. TEMPORARY: This is pretty much the only 'dirty' hack in this code: Imp array initialisation allows for " ()" but in order to calculate how many repeats are needed to fill the array when the " (*)" form was used, we needed to know the bounds of the array. This was not information that was easy to access at the point where the array initialisation was output, so it was passed globally as a hack. Once the translator is more or less finished, I'll revisit this and find a way to pass that information to where it is needed legitimately. (Which I suspect will involve a little bit of grammar restructuring, which is something I'm trying to minimise). This is also waiting for the declarations/symbol table/ type-structure rewrite. Despite the apparently huge list of things below still to be worked on, there's really only three major issues stopping the majority of programs from being compiled... type matching for a procedure's parameters; subfields of records; and non-0 lower-bound array indexing The solution to the first two is basic to the structure of the translator: pass information downwards that helps the code find info on what it is about to translate (eg symbol table info for record fields) and pass back *up* bottom-up type info on leaves. Eg assigning an integer * integer to a real - the expression is evaluated bottom up and creates an integer product, and that product is then converted to a real. We do not know when evaluating the leaves that they will eventually be assigned to a real so we do not convert the individual operands to a real, but rather wait until we get to the assignment of real = integer expression after evaluating the expression, before we see the need to do the type conversion. CURRENT JOB: I think records inside record declarations are being stored as C_DECLARE_RECORDFORMAT declarations rather than as a C_RECORD_TYPE ... need to find where this happens and correct it. Will also affect evaluation of record fields in lvalues. DONE. Then get back to rewrite of lvalue evaluation... ======================================= later: offer option to use more verbose form of translation to acheive compatibility with standard C - using PDS's case statement instead of jump table; make variables in begin/endofprogram blocks top-level statics, ex-bed nested procedures when they don't access any variables other than local or top-level. Use C strings and converting Imp I/O procedures to printf/scanf form. Assign to stdin/stdout when calling selectinout/selectoutput. Add dummy entries to arrays when LB is positive and below some threshold Make array declarations pointers to a zero-based array plus/minus offset. Only works for 1-D arrays *or* when n-D are implemented as arrays of pointers (which is not good C style but an acceptable way to translate Imp as long as the pointers are to a properly-laid-out contiguous 1-D array as would have been the case in Imp, so that pointer arithmetic on pointers within the array still works.) Initial support for 1D arrays is probably acceptable. ======================================= related: handle array accesses in at least 3 ways: 1) 2D arrays: [(idx)-base] vs (idx) using macros 2) 2+D arrays: as above, but issues when passed as parameters, vs iliffe vectors vs dope vectors -- ie 2s array would be a C array of pointers to slices in a proper 2D array (choice of row vsa column-major storage - Imp and C differ!) 3) full dope-vector where array access is via call to code that accesses the dope vector rather than the array by name in a readable C form eg %result = (int)access_array_4byte(fred_DV, 1, 2, 3) vs fred[1][2][3] vs fred[1-base1][2-base2][3-base3] vs fred(1,2,3) ====================================== The conversion of an if/then/else/elseif chain where all the conditions are constant, to #if/#elif/#else/#endif form, for conditional compilation, is broken - I think when there are if's without else's in the body of the above. I can across this when translating p5impcompiler to C ... needs some serious work to fix. Fortunately the use of standard if/then as a substitute for #if in Imp is restricted to a few EMAS programs and not an issue we hit too frequently. I haven't yet produced a minimal example for the test suite but that shouldn't be too hard, and needs to be done before I can start fixing it. ==================================================== %ifc c = '1' %start The blank line after %c is not skipped. It should be, at least for imp80. This is a line_reconstruction problem. Similarly: s = "OPTIONS" %then %start; monopt = 1; %finishelseifc {GT: messed up an edit - check against original if line below is OK } s = "SET" %then %start; setsigs flag = 1; %finishelseifc And... TCELL_PTYPE <- PTYPE %if PTYPE&x'c000'=x'8000' %then USEBITS = 2; ! externals presumed%c 'used' TCELL_UIOJ <- TCELL_UIOJ&X'3FF0'!USEBITS<<14 I'm not honoring %c in a ! comment ========================== These are currently crashing, but were not before, caused by addition of extra typefield slot in all tuples. */ ls -l tests/validation-suite/extern.imp ls -l tests/imptoc/rt25a.imp ls -l tests/imp68k/DJRBUG.imp /* - although might be OK now, after a bit of dicking around with the extra field stuff. =========================== Profiling: (excluding CHECKIDX which is only used in debugging version) str_to_pool can recover a little bit of a win but clearly parse() (in ../parser.c) is the workhorse where all the time goes. I suspect the issue is not primarily the efficiency of the code of parse() as it is the amount of backtracking being done (which is grammar-dependent) and the expensive char-at-a-time creation of names and numbers etc as opposed to using BIPs. However I've attempted to add some explicit time-keeping for failed parse attempts, and it doesn't show a high overhead; although I now see that the way of adding up the time may not be kosher and double- counting the good time is throwing everything off. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 92.41 77.42 77.42 1 77.42 80.68 parse 6.83 83.14 5.72 219646 0.00 0.00 str_to_pool 0.29 83.39 0.24 57224483 0.00 0.00 makespace_ 0.15 83.51 0.12 44317315 0.00 0.00 ONDEMAND 0.11 83.60 0.09 __x86.get_pc_thunk.bx 0.04 83.63 0.03 _dl_relocate_static_pie 0.02 83.65 0.02 608415 0.00 0.00 mktuple 0.02 83.67 0.02 204104 0.00 0.00 semantic 0.02 83.69 0.02 108085 0.00 0.00 line_reconstruction 0.02 83.71 0.02 11900 0.00 0.00 lookup_with_scope 0.02 83.73 0.02 10601 0.00 0.00 compile_expression_inner 0.01 83.74 0.01 2492419 0.00 0.00 match 0.01 83.75 0.01 153184 0.00 0.00 ctuple_inner 0.01 83.76 0.01 108479 0.00 0.00 storec 0.01 83.77 0.01 22095 0.00 0.00 compile_inner 0.01 83.78 0.01 3111 0.00 0.00 out_inner 0.00 83.78 0.00 294976 0.00 0.00 xfgetc 0.00 83.78 0.00 109337 0.00 0.00 stores 0.00 83.78 0.00 77598 0.00 0.00 lit 0.00 83.78 0.00 39201 0.00 0.00 C 0.00 83.78 0.00 25018 0.00 0.00 detuple_inner 0.00 83.78 0.00 14802 0.00 0.00 S_inner ------------------------------------ DC(10): VCURSOR (STATE, X, Y); %return DC(11): HCURSOR (STATE, X, Y); %return DC(*): %signal 14, 8 %end signal causes assertion failure adding C_SIGNAL to a C_SEQ list. presumably forgot to wrap a UI in C_SEQ somewhere... ------------------------------------ IDIOT! BFOTO! you don't need variant record info in the name table entry for a record, *only* in the C_DECLARE_RECORD tuple for when the declaration is generated! Once it is declared, the fields might as well all be separate as far as later code generation is involved. ------------------------------------ I though I had fixed this? Apparently not... */ //%begin %comment = /* THIS LINE IS BEING PASSED THROUGH VERBATIM. WHY? */ ! some problem with comments containing an '=' ??? ! Might be related to the grammar hack about initialisation //%endofprogram /* ------------------------------------ need option for variation of Imp where %C is obeyed within a comment. !* THESE MAY BE 16 OR 32 BIT DEPENDING ON Q * !*********************************************************************** ! ABORT %UNLESS 0<=H<=1 %AND 0<=Q<=1 %AND 0<=N<=127 %C %AND OPCODE&1=0 PLANT(OPCODE<<8!H<<8!Q<<7!N) %IF Q#0 %THEN PLANT(MASK<<8!FILLER) But also need to handle variation where it isn't. (I thought there was one but maybe there isn't?) ------------------------------------ grammar: still doesn't recognise %RECORDNAME OP(RD) as same as %RECORD(RD)%NAME OP - fix this, as it happens too often with old IMP9 sources. Note: needs grammar change and table rebuilding, which was removed when migrating files from imps/ directory to imptoc/ one. ------------------------------------ ptr = ptr+1 %and insert(M'==\', p_spacop, p_spacop) %and ->inc ^ \\ Ptr++;Insert('==\',P.Spacop,P.Spacop);goto Inc; ---------------------------------- When compiling ../../imp9/imp918s-77.imp ... ? Declare(Ctype, C_DECLARE_SWITCH) at scope 3 <- 5124426 at line 322 Looking up switch label dest for jump - declaration was at 5124426 ? C_TYPE_PARAMETERS not yet handled - retrying as a C_TYPE_INT AstOP(head): C_IF ctype: C_SEQ imps: ./compile.c:785: append_to_type_inner: Assertion `AstOP(head) == ctype' failed. ../new-parser/imps/imps: line 131: 30844 Aborted (core dumped) ~/src/compilers101/new-parser/imps/bin/imps $other /tmp/$$.imp > /tmp/$$.c void append_to_type_inner(int ctype, int *list, int item, int line) { // meant to be appending two lists, but seems to be appending a bare item to the list in places int stopper = 0; if (*list == -1) { if (item == -1) return; *list = ctuple(ctype, item, -1); return; } if (item == -1) return; // end of list should already be -1. int head = *list; while (rightchild(head) != -1) { if (AstOP(head) != ctype) { fprintf(stderr, "AstOP(head): %s\n", CAstOPName(AstOP(head))); fprintf(stderr, "ctype: %s\n", CAstOPName(ctype)); assert(AstOP(head) == ctype); } ---------------------------------- %if sym = '.' %and '0' <= next symbol <= '0' %start if (Sym == '.' && '0' <= Nextsymbol() && Nextsymbol() <= '9') ... ^_______________^ | side effects!!! => if (Sym == '.' && ({tmp1 = Nextsymbol(); '0' <= tmp1 && tmp1 <= '9';}) ... or if ({tmp1 = Nextsymbol(); (Sym == '.' && '0' <= tmp1 && tmp1 <= '9'}) ... tests/test178.imp - sub sub record fields not yet done assign address of object to pointer operand: if it is intrinsically an address, apply C_DEREFERENCE_ to it compare addresses of objects - use C_ADDRESS() Hope they cancel out correctly. array and record parameters, %name result type of fn/map/pred parameters type of record fields record subfields tests/test213.imp - simple %name parameters not handled properly yet ---------------------------------- TO DO: in several places where C requires a constant (eg switch labels, init for static scalars) we want to but cannot use constant expressions. So we have to substitute literal constants instead. It would be nice to output the named constant or expr as a comment alongside the //literal, eg */ Sw(10 /*begin*/): /* - at the moment, I have to manually exbed some static int initialisation (eg in chimps77.c) because const expressions or named const ints are not allowed. I could use the folded literal constant, as I already do in switch labels, but so much context is lost when you do that and the program becomes much less maintainable with effectively hundreds of 'magic constants' rather than symbolic constants. ---------------------------------- Old EMAS Imp included routine PPROFILE in its perms. Turns out it was for (you'll never guess this) profiling! We might be able to simulate this with the cyg_ code on entry and exit, if we add counters and an array for each procedure... Note the pdp15 perm which includes "%routine pprofile" may well be a typo and it should be "%routinespec pprofile"...? *OR* use gcc option -pg and gprof command. (I'm not entirely clear on the difference between prof and gprof) ---------------------------------- Note: I found one old file in the EMAS archives that was presumably originally input from a card deck, where the source code is 72 chars wide and characters 73-80 are ignored (used to line-number some of the cards for sorting). Actually the section of the code with the data in the card number field is a const table so at first glance does not look wrong although it actually is syntactically incorrect because a comma is absent. Anyway, if a file is in this format, only the first 72 chars of every line should be used. ---------------------------------- TO DO: pass in 'extra' info to compile - record subfield list, or type info that is wanted from expr. Declare C_TYPE_PARAMETERS I at scope 2 <- 84611 at line 5061 should be DECLARE_SCALAR of C_TYPE_INT with storage type set to C_STORAGE_param void Calgol(double C, _imp_string *S, void F(void), int *N); int G(int *N); N = 3; Calgol(A, S, G(), N); G not G(). Possibly even &G. ---------------------------------- I hit some problems translating old code with a lot of labels ('allimpc1.imp' - they were numerical labels and there were multiple '1:' (i.e. L__1:) occurances within main(). Looks like this may be solvable in C using 'local' labels, declared as '__label__ L__1;', which allow a label to be within the scope of a {} block. (In IMP, labels are unique within routines and begin/end blocks, but not cycles or start/finish blocks, which also map to {} blocks in C) (The problem with label clashes turns out to also apply to switch labels, since ->SW(3) generates Sw_3: which is a regular label. So if the programmer calls all his switches "SW" we get clashes.) https://gtoal.com/history.dcs.ed.ac.uk/archive/staging-area/JHB/ercc,%20emap,%20emas/Edinburgh%20IMP%20Language%20Manual%20(with%20updates%20to%201982).pdf Control 4.7 (pdf #43) Confirmed: labels are in a separate namespace in old EMAS imp. Not in Imp77! Don't know about Imp80. numeric labels in tange 1 to 16383 in old EMAS imp. I've now seen an actual example (in nrimp10s) where there is a label ST and an integer ST. I don't think it hurts to err on the side of safety and have both namespaces (though a compile time switch would be preferred) ---------------------------------- See also P44 for restrictions (cannot jump *into* start/finish or cycle/repeat blocks) Out of or within seems OK. Labels have block scope. ---------------------------------- Bunch of stuff to be handled re single quoted strings vs double-quotes strings. Not worth mentioning here yet as very obvious when you translated one of those programs so will show up again eventually. Same applies to any major missing features that we already have a test case for. printstring('An old-style string') *should* be compiled as a "string" because of the type of parameter to printstring. ---------------------------------- imp parameters are evaluated left to right; c's are right to left, so create a switch to force either form. if imp's are needed we have to assign any parameters with side effects sequentially first and then pass the evaluated values to the parameter list. almost a necessity to know if a function has side effects. doing this properly involves computing transitive closures. ----------------------------------