Last modified: 2001-08-30
The name DRFAQ isn't quite correct since I deviated from the typical question-answer structure, but as long as no real FAQ is compiled this page shall answer most of the questions about Dynamic Recompilation.
Dynamic Recompilation - Freqently Asked Questions
Maybe I should rename it to "Everything you have always wanted to know about Dynamic Recompilation but were afraid to ask" someday...
"DynaRec" and other Terms
More and more of the newer emulators replace their interpretive cores with a DynaRec. So what does this term mean? First of all, DynaRec is an an abreviation of Dynamic Recompilation or Dynamic Recompiler. The term recompilation is a bit confusing since it is a bit ambigious and doesn't exactly name the process performed. The problem is that you can take some source code and "recompile" it on a differnt platform. But dynamic recompilation means that you take the binary code (the executable file, eg.) and translate this into native machine code on another machine during runtime. Therefore I prefer the term Dynamic Binary Translation, but DynaRec is already well-established.
Some people even distinguish between dynamic compilation and recompilation, where the latter handles self-modifying code and the former does not. But most people use these terms synonymously and don't imply whether self-modifying code is handled or not.
Given the definition above it becomes obvious that a DynaRec is indeed similar to the just-in-time (JIT) compilers used in some Java runtime environments. The difference is that a DynaRec has to deal with low level machine code, probably full of dirty programming tricks, rather than more high-level bytecode produced by a compiler.
How does Dynamic Recompilation work in principle?
Let's take a closer look at how a DynaRec works in a simplified form. First of all we need to define a function pointer to call the translated code. The function should return some state so that we know if the code block terminated correctly, if an error occured, or if the code sequence was interrupted. We also need to pass the current context (register values, status flags, and the like) to the translated code, thus the function pointer could look like this:
int (*dyncode)(Context *CTX);
The simplified main loop of the DynaRec (often called dispatcher) might be written as:
address = block_translated(CTX->PC);
if (address == NULL)
address = translate_block(CTX->PC);
dyncode = (int(*)(Context*)) address;
status = (*dyncode)(CTX);
/* handle interrupts and other events here */
Only those parts of the application which are really executed will be translated, as can be seen in the following part of the loop where the called code is translated if it hasn't been done already:
if (address == NULL)
address = translate_block(CTX->PC);
Basic Blocks and Translation Units
The question now is how the binary translator knows which parts to translate if it doesn't translate the whole code. If it wasn't performed before, a new translation task is started when the code sequence to be executed is changed, meaning when a branch/jump/call occurs. The branch target thus acts as the start of the new code
sequence and the next branch as its terminator. In compiler theory such a code sequence is called a basic block. Not all experts agree on the same definition of a basic block, but most think that such a block should be terminated by an instruction which has the possibility to interrupt the sequential execution, ie. a conditional branch also ends a basic block.
For dynamic recompilation it's better to work with larger translation units, which are only terminated by unconditional jumps. The greatest strength of a dynarec lies in emulating loops since it doesn't provide any speed increase to translate code which is only executed once or twice. This strength can be exploited even more when using translation units instead of the standard basic blocks, since many loops can then be handled inside the generated code without the need to go through the dispatcher again and again on every run. But this method has to be combined with decent timing code to stay in sync with the rest of the emulated hardware.
The small statement
translate_block(CTX->PC); hides the most complicated part of the DynaRec which translates a block of code starting at the current program counter to native machine code.
The translation itself depends so strongly on the source and target processors involved that only a few examples will be presented throughout this FAQ. Different methods how to generate the translated code eventually are presented further down.
It wouldn't provide any speed-up in comparison to an interpretive emulator if the same block had to be translated over and over again. Therefore translations are stored in a part of the memory which is called the translation cache.
Every time when a different block of code is about to be called the dispatcher checks if this block has been translated already, which is performed by the following command:
How such fast look-ups for blocks in the translation can be done is covered later.
What other translation methods are possible?
Apart from dynamic recompilation there are also other possible translation methods. Although the differences between these methods might not be as strict as it first seems.
Static Binary Translation
The difference between dynamic and static binary translation is that the latter is not performed during runtime. Sometimes this implies that the whole source binary is translated to an executable target binary, but that isn't necessary for a static translator since its main feature is that no translation is done during runtime.
The main advantage of a static binary translator therefore is that it has more time to gather profiling information and produce better optimised code. Unlike a DynaRec a static one can exploit some of the classical compiler optimisation techniques.
The obvious disadvantage of static binary translation is its inability to cope with self-modifying code, which is often combined with the impossibility to translate indirect addressing. Due to this fallback interpreters are sometimes included, which handle such cases with a standard interpretive emulation.
Static binary translation can be useful when porting "clean" applications to a platform with the same or a very similar operating system that apart from translating the processor code only system calls have to be adjusted. In that case a good translator can provide similar performance as natively compiled code. But as soon as self-modifying code and proprietary hardware is involved dynamic recompilation is the better choice.
Threaded Interpretation or Decode Caching
Not being real translation, decode caching does share the problem of self-modifying code with DynaRecs. The idea of this approach is to spare decode time by storing pre-decoded structures for each instruction, which can be interpreted slightly faster.
The method turns into a threaded interpreter when the decode structures don't use synthetic opcodes but pointers to the simulation routines instead. This leads to a fast interpreter, but it's still slower than a DynaRec. The only advantage of this approach is that it is protable when a high-level language is used to program the interpretive emulator.
Decoding the Binary Source
The first thing to do when a new block of code is encountered is to identify its instructions by decoding them. This task directly matches the first stages of a processors pipeline.
At first the instructions have to be fetched while a distinction between two types of instruction encoding has to be made:
- Most RISC processors have fixed length instructions, ie. all instructions are encoded in the same number of bits, which normally is 32 or 16 bits.
In this simple case the appropriate number of bytes is fetched, the program counter increased, and you can continue to decode the whole instruction.
- For processors with a variable instruction length (about all CISCs and few RISCs) it's a bit more complicated, because fetch and decode steps have to be closer linked.
First the number of bytes of the smallest possible instruction (typically 1 or 2 bytes) is fetched. This data is partly decoded to identify the approriate routine for that opcode, which then fetches any additional data.
It gets even more complicated when instructions can have pre- or post-bytes as is the case with x86 instructions.
Instruction decoding in a DynaRec is done in the same way as in a disassembler or interpretive emulator, which is why I won't go into too much detail here.
The typical approach is to use switch-case statements to identify the opcodes and optional subclasses, as shown in the following excerpt from a MIPS disassembler. For processors with a large number of opcodes it's quite common to use a jump table instead.
switch (instruction >> 26) /* get opcode */
case 0: /* special */
switch (instruction & 63) /* get function code */
case 0: /* shift left logical */
case 3: /* jump and link */
One more hint on decoding as this is a commonly made error: Since the program counter is increased after every fetch it acutally points to the instruction following the one which is currently decoded, or even further down when the effect of the pipeline is not compensated. Therefore a branch with an offset of zero doesn't jump to itself but the succeding instruction.
- With a fixed instruction length this bevaviour can be easily obtained by increasing
the program counter right after the fetch. But I've seen many interpretive emulators
which increased the PC after execution of the instruction with the exception
of PC relative instructions, which not only leads to larger code but also to possible
- With variable length instructions it's the fastest way to adjust the PC right before
full decode of the instruction, since you cannot know in the beginning how many bytes
you'll have, and you surely don't want to increase the PC several times.
The main binary translation is so dependent on the two processor architectures involved that not much information can be provided here without getting too specific. The basic idea is to produce code sequence on the target processor which does about the same operations as would the code block on the source processor. Surely this goal can be reached in various ways.
The direct translation works on a instruction-per-instruction basis, ie. one instruction is isolated and translated, then the next and so on. With this method a DynaRec can be evolved out of an interpretive emulator relatively easily. But the disadvantage is that you end up with sub-optimal code since the only possible optimisations are register allocation and peephole optimisation with via a simple lookahead.
The better solution is to have some kind of intermediate representation which is passed from the decoder to the translator. In its easiest form this wouldn't be much more than the pre-decoded instructions stored in a linked list.
Such an intermediate representation helps to keep things modular and therefore makes the process more maintainable and also a bit more retargetable. But the big advantage is that you are working on pre-decoded data, which makes optimisation much easier since virtually unlimited lookahead is possible and you have all instructions in context not just isolated.
Virtual Intermediate Processor (VIP)
A very special case of an intermediate representation would be a virtual intermediate processor (VIP). In this case the intermediate format wouldn't just consist of the decoded source code, but would be the code of a specially designed instruction set just for this translation purpose.
Normally such a VIP is a simplification of the source processor, as is the case for phetacode in ARMphetamine. But the ultimate goal would be to have a general VIP which can be used for any translation and by this could enable portability of DynaRecs because only one translator had to be implemented for each host system, namely the translator from VIP code to native code. After my research in processor architectures it seems that it is almost impossible to define such a general VIP. Either the VIP will be too simplistic, which leads to awkward translations to this code. Or the VIP will be too complex, making it hard to port to a new platform. Only when the perfect compromise between these two extremes would be found programmers might be willing to utilize a general VIP. But reaching that compromise is the main problem as there are so many architectures with unique features that a compromise seems impossible.
As mentioned above most translation examples would be too sepcific to be of much use, but there are some problems which can occur very often, so these shall be covered here.
When translating a simple
ADD instruction it's enough to use a simple addition on the target processor as well, right? Wrong! This might compute the correct main result, but the side effects on the two processors might be totally different. When you take a look at an interpretive emulation of 6502 eg. you will notice that more time is spend to calculate the condition flags than the actual result of an operation. If you cannot decide during optimisation that these flags are not needed then you have to compute them in the translated code as well.
If you're lucky then the flags and their use is almost identical between source and target processors, as is the case with Z80 and x86 where only the overflow flag has a different position. In many other cases you can count on the stabdard flags being used: Zero, Negative, Carry, oVerflow. But some processors have totally different flags (PowerPC) or no condition register at all (MIPS, Alpha).
The PowerPC has a condition register with 8 condition fields of which the first two are set by integer and floating point operation while the rest can only be set by compare instructions. Each condition field contains the following flags: Less Than, Greater Than, Equal, Summary Overflow. I think only Summary Overflow
has to be explained, it is set on the first overflow and has to be cleared explicitly. There is also a normal Overflow flag in the special Fixed-Point Exception Register, which also contains the Carry flag. This means that the Carry cannot be tested directly (although it can be used by "extended" calculation instructions) and that the Summary Overflow had to be cleared after each instruction which modifies it when it is to be tested for individual instructions.
Those architectures without a condition register normally have special branch instructions which include either a comparision of two registers or a comparison of one register with zero. There are also instructions which compare two registers and set a third register to 1 or 0 according to the result. This third register can then be compared with zero. To decide whether the result is Zero or Negative is very easy, the Carry is a bit more complicated and can be computed by an unsigned comparision (see blow). On overflow detection such processors use to cast an exception, which makes it hard to handle that case in emulation.
As long as the condition flags are not referenced directly it's the best solution to translate test-effect-sequences to the corresponding ones. Therefore a sequence of a compare and a conditional branch should be translated to a compare and a conditional branch on the traget machine if possible.
The following shows how an extended addition programmed in x86 code had to be translated in some RISC architectures. Please note that only the four standard flags are calculated since the parity and half-carry flags would make this example even more complicated.
|IA-32||as ARM||as PowerPC||as MIPS|
add eax, 7
adds Ra, Ra, #7
li Rt, 7
addco. Ra, Ra, Rt
addi Ra, Ra, 7
sltiu RC, Ra, 7
sltiu RZ, Ra, 1
slt RN, Ra, R0
adc ebx, edx
adcs Rb, Rb, Rd
addeo. Rb, Rb, Rd
add Rb, Rb, RC
add Rb, Rb, Rd
sltiu RC, Rb, Rd
sltiu RZ, Rb, Rd
slt RN, Rb, R0
- The main purpose of these two instructions is to use a probable carry from the
first addition in the second one. Ideally only the carry is needed, which will
be reflected in the other comments. Also it is assumed that the flags from
the second addition aren't needed anymore.
- The attached '
s' in the instructions simply tells the processor
to set carry, overflow, zero, and negative flags. Apart from dropping the
s' in the second instruction no further reduction is possible.
- Not all PowerPC instructions can generate all flags, eg. there is no immediate
add which can compute overflows. Since no overflows or other flags than the
carry are needed the
li can be dropped and simpler
adde used instead.
- The MIPS has the hardest part because it has no flags at all and has to compute
these explicitly if they are needed. Since no overflow information (which would
only throw an exception anyway) is needed
addu can be used instead
add. Also only the first carry is needed, so all other instructions
can be dropped.
|IA-32||as ARM (opt)||as PowerPC (opt)||as MIPS (opt)|
add eax, 7
adds Ra, Ra, #7
addic Ra, Ra, 7
addiu Ra, Ra, 7
sltiu RC, Ra, 7
adc ebx, edx
adc Rb, Rb, Rd
adde Rb, Rb, Rd
addu Rb, Rb, RC
addu Rb, Rb, Rd
Carry vs Borrow
Many RISC processors have a so-called zero register which always yields the value zero. This can be a real nuisance to translate on processors which don't have a zero register or where it doesn't behave as orthogonal, eg.
R0 in PowerPC is only zero in certain addressing modes but otherwise works like a normal GPR, and Itanium's is always zero but throws an exception when it is used as destination register. SPARC uses its zero register that way quite often because it doesn't have a compare instruction and "simulates" it with a subtraction which only sets the condition codes and throws away the result:
subcc Rs1, Rs2, R0
There are two different ways to solve that problem. The first is to have a register reserved for that purpose which is set to zero again after every instruction which modifies it. Interpretive emulators normally work that way, but I wouldn't want to waste a register for that purpose.
The other way is to have a special case for each combination of zero registers, which can be up to five in the worst case, as shown in this example:
|MIPS instruction||as x86||as ARM|
sub zero, Rs, Rt
sub Rd, zero, zero
xor Rd, Rd
mov Rd, #0
sub Rd, zero, Rt
mov Rd, Rt
rsb Rd, Rt, #0
sub Rd, Rs, zero
mov Rd, Rs
mov Rd, Rs
sub Rd, Rs, Rt
mov Rd, Rs
sub Rd, Rt
sub Rd, Rs, Rt
Target Code Generation
Preassembled Code Templates
Dynamic Code Generation
Static Register Allocation
Dynamic Register Allocation
Prologue and Epilogue
The Master Block
How to map translated code blocks?
Translation Lookaside-Buffer (TLB)
Translation Map Array (TransMap)
UltraHLE's dirty trick
The problem with self-modifying code
Why is self-modifying code a problem?
How to detect self-modifying code?
The Checksum Approach
The Dirty Pages Approach
How to solve the problem with self-modifying code?
What optimisations can be performed?
Redundant Flag Calculation Elimination
|MIPS code||literal IA-32||optimised IA-32|
lui at, 0x1234
ori at, at, 0x5678
mov eax, 12340000h
or eax, 00005678h
mov eax, 12345678h
How to synchronize the code?
Sloppy Block Timing
Instruction Cycle Timing
What documents/whitepapers/articles about DynaRecs are available?
- DR Emulator
- This short article about the dynamically recompiling 68K emulator for PowerMacs
is a good starting point as it offers a good description of the difference
between an interpreter and a DynaRec. It even covers topics like self-modifying code.
- The whitepaper about Syn68k, the 68K emulator used in ARDI's Executor,
became something like a standard reference. It is quite good, no doubt, but hard
to understand for a beginner, and the code examples presented without explanation
are rather uninspiring.
- The document about ARMphetamine doesn't cover all the topics needed for a decent
DynaRec but the most important ones at great detail.
- Although Generator generates binary translations only for
instructions, the documentation isn't bad and it shows how interaction between a
threaded interpreter and a DynaRec can be implemented.
- Shade is not an emulator but a simulator to trace code, but many of its
techniques and terms were adopted for dynamic recompilation which makes it
quite interesting. To understand the few code examples slight knowledge of
SPARC assembly comes handy.
- Embra was influenced by Shade, but simulates a MIPS machine on system-level,
including MMU and cache simulation.
Documents on Related Topics
- FX!32 [PDF]
- FX!32 is likely to be one of the most successful static binary translators.
This is due to the fact that new portions of an application are interpreted first
and the binary translator uses gathered profiling information to generate highly
optimised code for "visited" translation units. The whole task is slightly simplified
by translating x86-NT binaries on an Alpha-NT platform.
- VCODE is a method created by Dawson Engler to quickly generate dynamic code.
Are there any Open Source DynaRecs?
Yes, all those known to me are listed in the following table:
||System||Source CPU||Target CPU||Host OS||Comments|
||Apple II||6502||MIPS, SPARC||Unix|
||few RISC OS SWIs||ARM||x86||Linux|
||Playstation||MIPS R3000A||x86||DOS||unfinished and discontinued|
||Playstation||MIPS R3000A||Alpha, x86||Linux||discontinued|
||Playstation||MIPS R3000A||x86, PowerPC||Win32/Linux, Amiga||last open source version: 0.8|
||Nintendo 64||MIPS R4300i||x86||Win32|
||Nintendo 64||MIPS R4300i||PowerPC||MacOS
This document would have been different, if not totally impossible, without the help
of the following people:
- Neil Bradley,
should be well-known for his contributions to the emulation scene
various CPU cores),
and provides us with dynarec.com.
- Victor Moya del Barrio,
continually impresses with new ideas and catchy terms.
- Neil Griffiths,
especially enjoys discussions about the prefix "re"
and is the webmaster of dynarec.com.
- Andrew Davidson,
author of the "lazy"
the first one who called for a document like this.
- Julian Brown,
who wrote an interesting paper about ARMphetamine.
- David Sharp, my personal proof reader.
- Last but not least, all authors of open source DynaRecs or documentations of