
A brief description of what a self-decompressing image looks like.
-----------------------------------------------------------------

RCC 7pm 27-Aug-87

The first word is an unconditional branch to the start of the 
decompression code. [NB different for AIF].

Then the encoded data of the file itself.  The encoding scheme is 
described later.

Then the encoding of the entries for the "shorts" table, a table
of up to 2048 4-byte words which will be encoded as the leading
nibble + one extra byte (I call this the short encoding).
See later for description of encoding of the table entries.

Then the encoding of the entries for the "longs" table, a table
of up to 1536 (= 6*256) 3-byte values.

Pad with up to 3 zeros to word-align.

A number of words in order thus:

0  Byte size of decoded image.
1  Byte size of encoded image.
2  Byte size of the encoding of the two tables [including padding].
3  Number of entries in "shorts" table.
4  Number of entries in "longs" table.
5  Byte size of the encoded data which must be copied elsewhere.

Then the decompression code itself.

Notes.
------

The last 2 numbers make it easy to decode as much as possible of the file
without copying it.


The encoding schemes.
---------------------

1. The main encoding scheme is that used to encode the body of the original.
   For space-efficient decoding, bytes are read in descending address order
   (i.e. "next byte" means *p--), so that the code can be interpreted in
   descending address order.
   Each word (4 bytes) is encoded in one of four ways, the encoding being
   specified by a 4 bit value (nibble) thus.

   nibble = 0      The word is zero

   nibble = 1      The word is given in full as the "next" four bytes,
                   low byte first. 

   nibble = 2..8   "long": The top three bytes of the word are obtained by
                   taking ((nibble - 2) << 8) | next byte as an index
                   into a table of up to 1.7K common values ("longs" table).
                   Then the low byte is given explicitly as the next byte.

   nibble = 9..15  "short": ((nibble-9) << 8) | next byte gives index into a
                   table of up to 1.7K common word values (the "shorts" table).
                  
   The nibbles describing the encoding of a pair of words (in ascending
   address order, e.g. first word at a, second at a+4) are packed together
   into a single byte, as first | (second << 4), so the sequence is thus:

   while (1) { /* bytes in descending address order */
     1-byte [first | (second << 4)]
     [0, 1, 2, or 4 bytes giving additional specification of first word]      
     [0, 1, 2, or 4 bytes giving additional specification of second word]      
   }

   The image will be padded with an arbitrary word to make it a multiple of
   8 bytes.


2. The "shorts" table is a table of up to 2K word values, in ascending order
   (of unsigned comparison, i.e. 0..FFFFFFFF).  It is encoded thus:

[This encoding was used up to version 0.14 - version 0.15 and later use
a more complicated encoding scheme, which spots sequences of consecutive
tabe entries, e.g. 2,3,4,5,... and encodes the number of entries <= 9 in
such a sequence as one byte, or encodes the delta in 0,1,2,or 3 more bytes.
Read the code for details.]

   For each entry j, j > 0, one byte specifying the encoding thus:
   
   byte = 0..125    table[j] = table[j-1] + byte
   byte = 126..254  table[j] = table[j-1] + ((byte-126) << 8) | nextbyte
   byte = 255       table[j] = table[j-1] + next 4 bytes, low byte first

   [We take table[-1] = 0 as the starting point].

3. The "longs" table is encoded almost the same as the shorts table,
   except that we encode the value of (table[j] >> 8), since the low bytes
   of all the entries are zero, and in the case byte = 255, we only have
   3 succeeding bytes, not 4.

I apologise for the confusion between ascending and descending byte-order.
In general I prefer low-byte first, but when we have to build an index
from 0..5 and another byte, the 0..5 must give the high-order bits.
[This confusion is even worse now bytes go downwards.  Hmmm..  Just have
to be careful].


The interface between compression and decompression code.
---------------------------------------------------------

The compression program includes a module which contains the decompression
code, and a little procedure to tell the compression program where to
find it.  The compressor guarantees to copy this code into the appropriate
place in the compressed image, and put a B to the start of it in the
first word of the compressed image.  I think that's all it needs.

[In 0.21 and later, this is complicated by the fact that the decompression
code includes a 3-word preamble to be executed only if the image is
AIF.]

Possible future changes.
------------------------

We could cope with images with load != exec by having the compressor
generate the last couple of instructions to jump to the start of the
decoded image (this is easier than trying to pass the exec address
dynamically).  Can keep it position-independent, as we have base of
image in R8, so just ADD PC, R8, #(exec-load).  [If exec-load is too
big, will need another instruction or two, but that's OK.]
[I've done this in version 0.14]

When the new image format arrives and the zero-initialised area is
dynamically initialised, there will be no advantage to the special
encoding of zero.  So eliminate this and give the 256 extra index values
to one of the tables [this also saves some decompression code and time].

Could allow the encoder to choose the division of the index space between
the two tables.  If we want to be really wacky, it could also choose
the number of explicit bits in the long encoding.  To make these choices,
encoder should first build a big word frequency table with everything
in it (sorted by frequency for choosing the shorts table, and by value
for building the longs frequency table) and munge this rather than scanning
the whole image several time.  [Actually, experiment suggests that 
fiddling with the long/short split makes only marginal difference.]

Could speed the compression up a great deal by not using streams,
tuning various bits of code, removing statistics stuff etc.
If you want it to go really fast, re-writing in C might be a good idea.
There are various algorithmic tweaks to be applied as well, eg not
sorting the things that don't make it into the table, using insertion
sort with binary chop to find position, fast movewords to shuffle.

Could speed up decoding literals if we get the byte-sex of the encoding right,
using the example for load arbitrarily-aligned word from the ARM CPU
manual (.. hmm, may not have quite enough registers).

Roger says it could handle Arthur load/exec filetype info more usefully.
[done in 0.15]

The table compression could be improved somewhat, particularly by
spotting sequences of +1's.
[done in 0.15]

I think there might be a significant win to be had by encoding a few 
longer sequences of instructions: for instance the procedure entry
sequence.  Unfortunately this contains a BL, which complicates matters
somewhat, but the rest could be encoded easily, e.g.

Mod-2 entry sequence

MOV   R11, SP
STMFD SP!,{FP,R11,R14,PC} ; and maybe others
SUB   FP, R11, #4
SUB   SP, SP, #4
CMPS  SP, SL
BLCC  |SYSTEM.STKOVF|

C entry sequence

MOV   R11, SP
STMFD SP!,{various}
SUB   FP, R11, #4
CMPS  SP, SL
BLCC  |x$stack_overflow|

This is 5 or 6 words with only 2 variable fields: the list of registers to be
saved, and the BLCC destination.  So we could compress it down a lot.
How much space could we save like this ?  Tricky question.  What is the
average length of a procedure ?

Investigating an image of ARX (OS, including window sys etc), we have
areas (code+data) 720K, and about 5800 symbols which may be procedures.
This gives an average length of 31 instructions/symbol.  If 5000 of these
had the full 6-word entry sequence, that accounts for 30000 words, i.e.
about 15% of the total size.  [Actually, counting occurrences of
CMPS SP,SL we find 3043 in ARX, so entry sequences are only about 8%,
which is probably hardly worth a special code. Most BL's are caught by the
current encoding scheme, in fact.]

In general, we don't do very well at spotting BL.  In fact most B foo
instructions are short and relative, most BL foo instructions are
going to absolute places: it would certainly be nice to spot that all
occurrences of BLCC |SYSTEM.STKOVF| are really the same.  This could
be handled by changing all offsets in BL instructions into offsets from
the image base before encoding, and then reconverting them on the fly
[We always know position relative to image base, so this is not too hard].

Could try to work out common macros of 2,3,4 instructions.  This would
easily catch long runs of zeros, which would be nice.  But if we want to
catch non word-pair aligned macros, decoding is complicated.

How good is the encoding scheme really ?
----------------------------------------

To attempt to find out how good the encoding scheme really is, I have
analysed the word frequency statistics for a number of programs,
calculating how many bits a Huffman encoding would be (excluding any
frequency tables).  This gives us a good measure of the amount of
redundancy in the file [viewed as a sequence of words], and thus a
lower bound on the size achievable by any compression algorithm which
treats the image just as a sequence of words.  The figures are as
follows:

   Program  Size  Huffman  Squeezed (without tables and code)  
     
   ab        45K    17K     25K     huff% = 17/45 = 37%
   cc       239K    91K    125K                     38%
   objasm   264K    81K    103K                     31%
   twin      42K    16K     21K                     38%
   diff      49K    16K     19K                     33%
   initf    406K   150K    199K                     37%

From this we can see that there is not a vast amount to be gained by
playing with the details of the encoding: no encoding can hope to reduce
things below about 37%, and given that we need to keep table overheads
small and decoding very fast, it is probably hard to get close to this.

The redundancy appears remarkably consistent: objasm is an aberration,
as 24% of it is Mod-2 static data which appears as zeros.


Load/exec address munging
-------------------------

This is all completely orthogonal to the compression stuff, but I describe
what squeeze does briefly.

First, it reads the load and exec addresses of the unsqueezed file.

If the load address is FFFFF8xx, it is is assumed that this is an Arthur
absolute image, so we take load = exec = &8000.

When the squeezed file is written, we put load = exec = old load address:
if load = &8000 and squeeze is running on an Arthur machine, we put
FFFFF8xx and the time of squeezing (i.e. Arthur absolute).


Small-memory squeeze [0.20 and later]
-------------------------------------

squeeze 0.20 and later use a number of neat tricks to use little space and
go fast.  Here is a brief overview of the algorithm.  For more details
read the code (it's only 800 lines).

1. Load the code to be squeezed.

2. Allocate a hash table of <word, frequency> pairs and scan the code
   to count frequencies of all non-zero words.

3. Flatten the table into a big list, and determine the maximum frequency.

4. Find the most common words by distribution sort.
   Make list of the <word, freq> pairs not in this set of common ones.

5. Sort the common words by value using the ANSI-C library qsort.

6. Allocate another hash table, scan the list of infrequent words and
   use hash table to accumulate frequencies of 3-byte values, masking
   out the low byte.

7. Repeat steps 3,4,5 above to build table of common 3-byte values.

8. Build two hash tables to determine encoding of a 4-byte or 3-byte
   value, viz mapping word -> index in table, or -1 if not found.

9. Scan the image, encoding pairs of words.  As much as possible is
   encoded in place, to save space and time.

Hidden amongst this are various sneaky tricks which are documented in
the code, e.g. packing an address and frequency together to fit
3 values in a 2-word record, clever storage allocation etc.
