Last modified: 2001-08-30


Dynamic Recompilation - Freqently Asked Questions

by Michael König (aka M.I.K.e)

The name DRFAQ isn't quite correct since I deviated from the typical question-answer structure, but as long as no real FAQ is compiled this page shall answer most of the questions about Dynamic Recompilation.
Maybe I should rename it to "Everything you have always wanted to know about Dynamic Recompilation but were afraid to ask" someday...

"DynaRec" and other Terms

More and more of the newer emulators replace their interpretive cores with a DynaRec. So what does this term mean? First of all, DynaRec is an an abreviation of Dynamic Recompilation or Dynamic Recompiler. The term recompilation is a bit confusing since it is a bit ambigious and doesn't exactly name the process performed. The problem is that you can take some source code and "recompile" it on a differnt platform. But dynamic recompilation means that you take the binary code (the executable file, eg.) and translate this into native machine code on another machine during runtime. Therefore I prefer the term Dynamic Binary Translation, but DynaRec is already well-established.

Some people even distinguish between dynamic compilation and recompilation, where the latter handles self-modifying code and the former does not. But most people use these terms synonymously and don't imply whether self-modifying code is handled or not.

Given the definition above it becomes obvious that a DynaRec is indeed similar to the just-in-time (JIT) compilers used in some Java runtime environments. The difference is that a DynaRec has to deal with low level machine code, probably full of dirty programming tricks, rather than more high-level bytecode produced by a compiler.

How does Dynamic Recompilation work in principle?

Let's take a closer look at how a DynaRec works in a simplified form. First of all we need to define a function pointer to call the translated code. The function should return some state so that we know if the code block terminated correctly, if an error occured, or if the code sequence was interrupted. We also need to pass the current context (register values, status flags, and the like) to the translated code, thus the function pointer could look like this:
	int (*dyncode)(Context *CTX);
The simplified main loop of the DynaRec (often called dispatcher) might be written as:
	for(;;) {
	  address = block_translated(CTX->PC);
	  if (address == NULL)                                        
	    address = translate_block(CTX->PC);
	  dyncode = (int(*)(Context*)) address;
	  status = (*dyncode)(CTX);
	  /* handle interrupts and other events here */                                  

On-Demand Translation

Only those parts of the application which are really executed will be translated, as can be seen in the following part of the loop where the called code is translated if it hasn't been done already:
	if (address == NULL)                                        
	  address = translate_block(CTX->PC);

Basic Blocks and Translation Units

The question now is how the binary translator knows which parts to translate if it doesn't translate the whole code. If it wasn't performed before, a new translation task is started when the code sequence to be executed is changed, meaning when a branch/jump/call occurs. The branch target thus acts as the start of the new code sequence and the next branch as its terminator. In compiler theory such a code sequence is called a basic block. Not all experts agree on the same definition of a basic block, but most think that such a block should be terminated by an instruction which has the possibility to interrupt the sequential execution, ie. a conditional branch also ends a basic block.

For dynamic recompilation it's better to work with larger translation units, which are only terminated by unconditional jumps. The greatest strength of a dynarec lies in emulating loops since it doesn't provide any speed increase to translate code which is only executed once or twice. This strength can be exploited even more when using translation units instead of the standard basic blocks, since many loops can then be handled inside the generated code without the need to go through the dispatcher again and again on every run. But this method has to be combined with decent timing code to stay in sync with the rest of the emulated hardware.

Code Generation

The small statement translate_block(CTX->PC); hides the most complicated part of the DynaRec which translates a block of code starting at the current program counter to native machine code.

The translation itself depends so strongly on the source and target processors involved that only a few examples will be presented throughout this FAQ. Different methods how to generate the translated code eventually are presented further down.

Translation Cache

It wouldn't provide any speed-up in comparison to an interpretive emulator if the same block had to be translated over and over again. Therefore translations are stored in a part of the memory which is called the translation cache.

Every time when a different block of code is about to be called the dispatcher checks if this block has been translated already, which is performed by the following command:

How such fast look-ups for blocks in the translation can be done is covered later.

What other translation methods are possible?

Apart from dynamic recompilation there are also other possible translation methods. Although the differences between these methods might not be as strict as it first seems.

Static Binary Translation

The difference between dynamic and static binary translation is that the latter is not performed during runtime. Sometimes this implies that the whole source binary is translated to an executable target binary, but that isn't necessary for a static translator since its main feature is that no translation is done during runtime.

The main advantage of a static binary translator therefore is that it has more time to gather profiling information and produce better optimised code. Unlike a DynaRec a static one can exploit some of the classical compiler optimisation techniques.

The obvious disadvantage of static binary translation is its inability to cope with self-modifying code, which is often combined with the impossibility to translate indirect addressing. Due to this fallback interpreters are sometimes included, which handle such cases with a standard interpretive emulation.

Static binary translation can be useful when porting "clean" applications to a platform with the same or a very similar operating system that apart from translating the processor code only system calls have to be adjusted. In that case a good translator can provide similar performance as natively compiled code. But as soon as self-modifying code and proprietary hardware is involved dynamic recompilation is the better choice.

Threaded Interpretation or Decode Caching

Not being real translation, decode caching does share the problem of self-modifying code with DynaRecs. The idea of this approach is to spare decode time by storing pre-decoded structures for each instruction, which can be interpreted slightly faster.

The method turns into a threaded interpreter when the decode structures don't use synthetic opcodes but pointers to the simulation routines instead. This leads to a fast interpreter, but it's still slower than a DynaRec. The only advantage of this approach is that it is protable when a high-level language is used to program the interpretive emulator.

Decoding the Binary Source

The first thing to do when a new block of code is encountered is to identify its instructions by decoding them. This task directly matches the first stages of a processors pipeline.

Instruction Fetch

At first the instructions have to be fetched while a distinction between two types of instruction encoding has to be made:
  1. Most RISC processors have fixed length instructions, ie. all instructions are encoded in the same number of bits, which normally is 32 or 16 bits.
    In this simple case the appropriate number of bytes is fetched, the program counter increased, and you can continue to decode the whole instruction.
  2. For processors with a variable instruction length (about all CISCs and few RISCs) it's a bit more complicated, because fetch and decode steps have to be closer linked.
    First the number of bytes of the smallest possible instruction (typically 1 or 2 bytes) is fetched. This data is partly decoded to identify the approriate routine for that opcode, which then fetches any additional data.
    It gets even more complicated when instructions can have pre- or post-bytes as is the case with x86 instructions.

Instruction Decode

Instruction decoding in a DynaRec is done in the same way as in a disassembler or interpretive emulator, which is why I won't go into too much detail here.

The typical approach is to use switch-case statements to identify the opcodes and optional subclasses, as shown in the following excerpt from a MIPS disassembler. For processors with a large number of opcodes it's quite common to use a jump table instead.

	switch (instruction >> 26) /* get opcode */
	  case 0: /* special */
	       switch (instruction & 63) /* get function code */
	         case 0: /* shift left logical */
	  case 3: /* jump and link */
One more hint on decoding as this is a commonly made error: Since the program counter is increased after every fetch it acutally points to the instruction following the one which is currently decoded, or even further down when the effect of the pipeline is not compensated. Therefore a branch with an offset of zero doesn't jump to itself but the succeding instruction.

Binary Translation

The main binary translation is so dependent on the two processor architectures involved that not much information can be provided here without getting too specific. The basic idea is to produce code sequence on the target processor which does about the same operations as would the code block on the source processor. Surely this goal can be reached in various ways.

Direct Translation

The direct translation works on a instruction-per-instruction basis, ie. one instruction is isolated and translated, then the next and so on. With this method a DynaRec can be evolved out of an interpretive emulator relatively easily. But the disadvantage is that you end up with sub-optimal code since the only possible optimisations are register allocation and peephole optimisation with via a simple lookahead.

Intermediate Representation

The better solution is to have some kind of intermediate representation which is passed from the decoder to the translator. In its easiest form this wouldn't be much more than the pre-decoded instructions stored in a linked list.

Such an intermediate representation helps to keep things modular and therefore makes the process more maintainable and also a bit more retargetable. But the big advantage is that you are working on pre-decoded data, which makes optimisation much easier since virtually unlimited lookahead is possible and you have all instructions in context not just isolated.

Virtual Intermediate Processor (VIP)

A very special case of an intermediate representation would be a virtual intermediate processor (VIP). In this case the intermediate format wouldn't just consist of the decoded source code, but would be the code of a specially designed instruction set just for this translation purpose.

Normally such a VIP is a simplification of the source processor, as is the case for phetacode in ARMphetamine. But the ultimate goal would be to have a general VIP which can be used for any translation and by this could enable portability of DynaRecs because only one translator had to be implemented for each host system, namely the translator from VIP code to native code. After my research in processor architectures it seems that it is almost impossible to define such a general VIP. Either the VIP will be too simplistic, which leads to awkward translations to this code. Or the VIP will be too complex, making it hard to port to a new platform. Only when the perfect compromise between these two extremes would be found programmers might be willing to utilize a general VIP. But reaching that compromise is the main problem as there are so many architectures with unique features that a compromise seems impossible.

Exemplary Problems

As mentioned above most translation examples would be too sepcific to be of much use, but there are some problems which can occur very often, so these shall be covered here.

Condition Flags

When translating a simple ADD instruction it's enough to use a simple addition on the target processor as well, right? Wrong! This might compute the correct main result, but the side effects on the two processors might be totally different. When you take a look at an interpretive emulation of 6502 eg. you will notice that more time is spend to calculate the condition flags than the actual result of an operation. If you cannot decide during optimisation that these flags are not needed then you have to compute them in the translated code as well.

If you're lucky then the flags and their use is almost identical between source and target processors, as is the case with Z80 and x86 where only the overflow flag has a different position. In many other cases you can count on the stabdard flags being used: Zero, Negative, Carry, oVerflow. But some processors have totally different flags (PowerPC) or no condition register at all (MIPS, Alpha).

The PowerPC has a condition register with 8 condition fields of which the first two are set by integer and floating point operation while the rest can only be set by compare instructions. Each condition field contains the following flags: Less Than, Greater Than, Equal, Summary Overflow. I think only Summary Overflow has to be explained, it is set on the first overflow and has to be cleared explicitly. There is also a normal Overflow flag in the special Fixed-Point Exception Register, which also contains the Carry flag. This means that the Carry cannot be tested directly (although it can be used by "extended" calculation instructions) and that the Summary Overflow had to be cleared after each instruction which modifies it when it is to be tested for individual instructions.

Those architectures without a condition register normally have special branch instructions which include either a comparision of two registers or a comparison of one register with zero. There are also instructions which compare two registers and set a third register to 1 or 0 according to the result. This third register can then be compared with zero. To decide whether the result is Zero or Negative is very easy, the Carry is a bit more complicated and can be computed by an unsigned comparision (see blow). On overflow detection such processors use to cast an exception, which makes it hard to handle that case in emulation.

As long as the condition flags are not referenced directly it's the best solution to translate test-effect-sequences to the corresponding ones. Therefore a sequence of a compare and a conditional branch should be translated to a compare and a conditional branch on the traget machine if possible.

The following shows how an extended addition programmed in x86 code had to be translated in some RISC architectures. Please note that only the four standard flags are calculated since the parity and half-carry flags would make this example even more complicated.

IA-32as ARMas PowerPCas MIPS
add eax, 7 adds Ra, Ra, #7
li     Rt, 7
addco. Ra, Ra, Rt
addi  Ra, Ra, 7
sltiu RC, Ra, 7
sltiu RZ, Ra, 1
slt RN, Ra, R0
adc ebx, edx adcs Rb, Rb, Rd addeo. Rb, Rb, Rd
add   Rb, Rb, RC
add Rb, Rb, Rd
sltiu RC, Rb, Rd
sltiu RZ, Rb, Rd
slt RN, Rb, R0
The main purpose of these two instructions is to use a probable carry from the first addition in the second one. Ideally only the carry is needed, which will be reflected in the other comments. Also it is assumed that the flags from the second addition aren't needed anymore.
The attached 's' in the instructions simply tells the processor to set carry, overflow, zero, and negative flags. Apart from dropping the 's' in the second instruction no further reduction is possible.
Not all PowerPC instructions can generate all flags, eg. there is no immediate add which can compute overflows. Since no overflows or other flags than the carry are needed the li can be dropped and simpler addic and adde used instead.
The MIPS has the hardest part because it has no flags at all and has to compute these explicitly if they are needed. Since no overflow information (which would only throw an exception anyway) is needed addu can be used instead of add. Also only the first carry is needed, so all other instructions can be dropped.
IA-32as ARM (opt)as PowerPC (opt)as MIPS (opt)
add eax, 7 adds Ra, Ra, #7 addic Ra, Ra, 7 addiu Ra, Ra, 7
sltiu RC, Ra, 7
adc ebx, edx adc Rb, Rb, Rd adde Rb, Rb, Rd addu Rb, Rb, RC
addu Rb, Rb, Rd
Carry vs Borrow

Zero Registers

Many RISC processors have a so-called zero register which always yields the value zero. This can be a real nuisance to translate on processors which don't have a zero register or where it doesn't behave as orthogonal, eg. R0 in PowerPC is only zero in certain addressing modes but otherwise works like a normal GPR, and Itanium's is always zero but throws an exception when it is used as destination register. SPARC uses its zero register that way quite often because it doesn't have a compare instruction and "simulates" it with a subtraction which only sets the condition codes and throws away the result:
	subcc Rs1, Rs2, R0
There are two different ways to solve that problem. The first is to have a register reserved for that purpose which is set to zero again after every instruction which modifies it. Interpretive emulators normally work that way, but I wouldn't want to waste a register for that purpose.

The other way is to have a special case for each combination of zero registers, which can be up to five in the worst case, as shown in this example:

MIPS instructionas x86as ARM
sub zero, Rs, Rt nop nop
sub Rd, zero, zero xor Rd, Rd mov Rd, #0
sub Rd, zero, Rt mov Rd, Rt
neg Rd
rsb Rd, Rt, #0
sub Rd, Rs, zero mov Rd, Rs mov Rd, Rs
sub Rd, Rs, Rt mov Rd, Rs
sub Rd, Rt
sub Rd, Rs, Rt

Memory Access


Target Code Generation


Preassembled Code Templates

Dynamic Code Generation

Threaded Code

Register Allocation

Static Register Allocation

Dynamic Register Allocation

Calling Conventions

Prologue and Epilogue

The Master Block

Glue Code

How to map translated code blocks?

Translation Lookaside-Buffer (TLB)

Translation Map Array (TransMap)

Paged TransMap

UltraHLE's dirty trick

The problem with self-modifying code

Why is self-modifying code a problem?

How to detect self-modifying code?

The Checksum Approach

The Dirty Pages Approach

System Knowledge

How to solve the problem with self-modifying code?

What optimisations can be performed?

Redundant Flag Calculation Elimination

Register Allocation

Peephole Optimisation

MIPS codeliteral IA-32optimised IA-32
 lui at, 0x1234
 ori at, at, 0x5678
 mov eax, 12340000h
 or  eax, 00005678h
 mov eax, 12345678h

How to synchronize the code?

Sloppy Block Timing

Instruction Cycle Timing

What documents/whitepapers/articles about DynaRecs are available?

DR Emulator
This short article about the dynamically recompiling 68K emulator for PowerMacs is a good starting point as it offers a good description of the difference between an interpreter and a DynaRec. It even covers topics like self-modifying code.
The whitepaper about Syn68k, the 68K emulator used in ARDI's Executor, became something like a standard reference. It is quite good, no doubt, but hard to understand for a beginner, and the code examples presented without explanation are rather uninspiring.
The document about ARMphetamine doesn't cover all the topics needed for a decent DynaRec but the most important ones at great detail.
Although Generator generates binary translations only for move instructions, the documentation isn't bad and it shows how interaction between a threaded interpreter and a DynaRec can be implemented.
Shade is not an emulator but a simulator to trace code, but many of its techniques and terms were adopted for dynamic recompilation which makes it quite interesting. To understand the few code examples slight knowledge of SPARC assembly comes handy.
Embra was influenced by Shade, but simulates a MIPS machine on system-level, including MMU and cache simulation.

Documents on Related Topics

FX!32 [PDF]
FX!32 is likely to be one of the most successful static binary translators. This is due to the fact that new portions of an application are interpreted first and the binary translator uses gathered profiling information to generate highly optimised code for "visited" translation units. The whole task is slightly simplified by translating x86-NT binaries on an Alpha-NT platform.
VCODE is a method created by Dawson Engler to quickly generate dynamic code.

Are there any Open Source DynaRecs?

Yes, all those known to me are listed in the following table:
Emulator SystemSource CPUTarget CPUHost OSComments
NEStra NES6502x86Linux
UAE AmigaMC68000x86Unix
ARMphetamine few RISC OS SWIsARMx86Linux
Titan Sega SaturnSH2x86Win32
Pex PlaystationMIPS R3000Ax86DOSunfinished and discontinued
Sope PlaystationMIPS R3000AAlpha, x86Linuxdiscontinued
FPSE PlaystationMIPS R3000Ax86, PowerPCWin32/Linux, Amigalast open source version: 0.8
1964 Nintendo 64MIPS R4300ix86Win32
sixtyforce Nintendo 64MIPS R4300iPowerPCMacOS


This document would have been different, if not totally impossible, without the help of the following people: