Julian Brown, 21 January 2001. Updated 4 February 2001.
The original intermediate representation used by ARMphetamine ("Phetacode"), while just about adequate, suffered from a couple of fairly serious flaws which limited its usefulness. These included:
Ideally, we want a language which is amenable to 'real' register allocation and flow control. We want to keep everything well-defined and mind-numbingly simple, so it's easy to see when things are going wrong. We want to be able to do condition-code abstraction (see later), without hurting our heads.
Some things remain the same: the "chunk-based" profiling algorithm, the rationale behind not jumping into the middle of blocks, dynamic register allocation, the existence of an intermediate representation at all, etc. ph2 will primarily support the 32-bit addressing mode of the ARM, but the 26-bit addressing mode will be kept in mind for later addition.
In contrast to the original phetacode, we're going to use a variable-length bytecode, mainly to keep things neat but extendable, but also because we're only building a decoder in software, and it won't hurt much. Note though that it may be necessary to scan this code backwards, so each instruction shall be postfixed with its length in bytes.
Basic operations shall include:
None of these instructions deals with 'real' ARM registers, nor does it reference condition-code flags directly. Instead, a special set of (currently 256) registers is used to hold values used in calculations. A pair of instructions:
Are used to move registers from the ARM to the ph2 registers when they're needed, and move them back when they're done with. This means a register-allocation function need concern itself only with the fetch and commit instructions.
Each time a ph2 instruction generates a result, a new register is allocated. This is important for register-colour based allocation (I believe?). With 256 registers available in one byte of storage, this places a worst-case theoretical limit of 1kb on each chunk of code, but I don't see this being a serious restriction since most chunks will probably be much shorter than this.
If a ph2 instruction must modify ARM flags, it must be wrapped in a pair of instructions:
Writing these explicitly around instructions which modify flags has the advantage that a function to remove redundant flag calculations may modify (only) fexpect and fcommit instructions in-place. There is also an instruction,
Which is used when the following instruction requires a particular set of flags to be valid (eg, ADC, RRX). These instructions should greatly facilitate condition code abstraction.
allows a flag's value to be written directly.
When "meta" control flow is required, such as in the fiddly translation of shift-by-register instructions, the variations of the above instructions:
can be used instead.
ph2 code offers NO internal branch instruction. Instead, the natural grouping of instructions is the basic block, and a graph of basic blocks is generated on-the-fly as ARM code is translated to intermediate code. Each ph2 basic block has an associated condition, a TRUE destination block and a FALSE destination block. Although this makes translation from ARM code to ph2 code more difficult, the gain in ease of manipulation should make it worthwhile.
The exact algorithm for translating ph2 code to native x86 code has not yet been fully designed. It will hopefully include:
In full, the current ph2 instruction set looks like this:
|CONST||Load constant word||Immediate word (4-byte, little-endian) value|
|CONSTB||Load constant byte||Immediate byte value|
|FETCH||Fetch ARM register||ph2 r# (one byte), ARM r# (one byte)|
|COMMIT||Commit ARM register||ARM r#, ph2 r#|
|FEXPECT||Prime following instruction for setting flags||condition-code set (one byte)|
|FCOMMIT||Indicate that flags should have been set, write into ARM flags||condition-code set (one byte)|
|FENSURE||Ensure that flags are resident in native flags register||condition-code set (one byte)||NFEXPECT||Same as fexpect, but don't involve ARM flags||condition-code set (one byte)|
|NFCOMMIT||Indicate that (native) flags should have been set||condition-code set (one byte)|
|NFENSURE||Roughly the same as fensure||condition-code set (one byte)|
|FWRITE||Write bit zero of register into a flag||flag number, ph2 r#|
|XJMP||Transfer control to start of another chunk||Address of destination chunk (4 bytes)|
|LSL||Logical shift-left||dest r#, src1 r#, src2 r#|
|LSR||Logical shift right||dest r#, src1 r#, src2 r#|
|ASR||Arithmetic shift right||dest r#, src1 r#, src2 r#|
|ROR||Rotate right||dest r#, src1 r#, src2 r#|
|ROL||Rotate left||dest r#, src1 r#, src2 r#|
|RRX||Rotate with extend through carry flag||dest r#, src r#|
|RLX||Rotate left with extend through carry flag||dest r#, src r#|
|MOV||Copy register||dest r#, src r#|
|NOT||Copy negated register||dest r#, src r#|
|AND||Bitwise AND||dest r#, src1 r#, src2 r#|
|OR||Bitwise OR||dest r#, src1 r#, src2 r#|
|EOR||Bitwise Exclusive OR||dest r#, src1 r#, src2 r#|
|TEQ||Bitwise Exclusive OR, discarding result||src1 r#, src2 r#|
|TST||Bitwise AND, discarding result||src1 r#, src2 r#|
|ADD||Addition||dest r#, src1 r#, src2 r#|
|ADC||Add with carry||dest r#, src1 r#, src2 r#|
|SUB||Subtract||dest r#, src1 r#, src2 r#|
|SBC||Subtract with carry (in the ARM sense, ie rd = rm + rn + NOT(c))||dest r#, src1 r#, src2 r#|
|CMP||Subtract discarding result||src1 r#, src2 r#|
|CMN||Add discarding result||src1 r#, src2 r#|
|MUL||Multiply||dest r#, src1 r#, src2 r#|
|LDW||Load word||dest r#, addr r#|
|LDB||Load byte||dest r#, addr r#|
|STW||Store word||addr r#, src r#|
|STB||Store byte||addr r#, src r#|
|SWI||Generate software interrupt||(none?)|
|UNDEF||Trigger undefined instruction exception||(none?)|
|SYNC||Synchronise register allocation (dump all registers to memory)||(none)|
|END||Finish a basic block||(none)|
It is the intention that cycles will be (approximately) counted (decremented), and checked for zero-crossings at the end of each basic block, for the purposes of interrupt handling, which should hopefully provide "good enough" latency for most applications. In the unlikely event that enough instructions accumulate inbetween checks that two or more interrupts should have been fired, both can (probably?) be done simultaneously until the 'debt' is repaid. This will undoubtedly cause havoc with timing-sensitive code.
Memory access should use the two-entry TLB (not) documented elsewhere. There is a big problem though, and that is restoring registers after an exception. In general, we won't be returning to the original recompiled chunk after an exception has occured (well, we never will be), but before it happened our register values could have been anywhere.
We're going to take the stance that page faults, access permission faults, etc. hardly ever happen, so try to streamline the most frequent case (that is, hopefully, we have the right TLB entry and the access has the correct permissions). In that case, we run inline code. The next-worst scenario, TLB miss, will be dealt with by a function call, in the same way that memory access works in the current system. This can harmlessly continue running the same recompiled chunk.
In the worst scenario, page fault/access denied, one way of coping with the situation would be to have a look-up table associated with each load/store instruction in each block, which stores the mapping between ARM and x86 registers at the corresponding point in the code, so the complete transfer to the memory-based register file need be done only when an exception occurs. There would probably be a good deal of evil which needed to be done to get that working properly though - whether it'd be worth the effort over just saving all registers before memory accesses and hence avoiding the problem at the expense of potentially less efficient code, I don't know.