Tales of Bit-Twisting Terror

Episode 2: Fear Not a Constant Mod

A piece of hardware had five memory-mapped registers implementing an array of five words.  For obscure but important reasons, the words could not be ordered as one might expect: [0] first and [4] last.  No, instead, the memory-mapped registers in address order were actually going to contain the elements of the array in the order [1], [2], [3], [4], [0].

Odd, but it could be worse. I coded up some access macros that would map an index into this array.  Being once upon a time a mathematician, I knew that the simple mapping (i-1) mod 5 would do the trick to convert a "normal" index into the rotated form.  And so would (i+4) mod 5, and this variant might be somewhat safer when used with unsigned data.

The resulting macro elicited some groans.  "We have to do a division to access an MMR?  This is going to slow down a hot path through the interrupt handler!"

Not really.  You see, when we designed the instruction set for the Cray X1 processor, we made sure that it would support a variant form of the integer multiplication instruction.  Binary multiplication of two n-bit values yields a product with 2n bits in it.  A normal integer multiplication function truncates that product and returns its lower half.  The variant form, however, returns the upper half of the product.  (This instruction exists on the DEC Alpha processor as UMULH, and the upper half of a binary product is also available by other means on the MIPS and Intel x86 processors too.)

Besides its obvious value for multiprecision arithmetic, being able to recover the upper half of a binary product makes integer division by a constant value into a cheap operation, at least when compared to the general integer division instruction (assuming that a processor even has one).  And if division by a constant is inexpensive, then so is a modulus.

The technique is straightforward, though the resulting machine instructions seem somewhat magical.  The idea is to replace an integer division by a constant value with a multiplication by the value's fixed-point reciprocal value.

To implement a fast unsigned 64-bit division by five, as in this case, you just multiply by 14757395258967641292, take the upper half of the 128-bit full product, and shift it right two bits.

Whoa.

This makes much more sense if you consider the binary values involved as fixed-point numbers and watch where the binary point is placed in them.  Everybody knows what floating-point values look like (IEEE-754 nearly exclusively today) and more or less how they work, but non-integral fixed-point numbers are somewhat less commonly known and used.

The binary reciprocal of five is a repeating fraction.

1/5 == 1/8 + 1/16 + 1/128 + 1/256 + ... == 0.0011001100110011... (binary)

To get 64-bit precision, we'll effectively shift that value leftwards two bits, and use

4/5 == 0.11001100110011... (binary)
    == 0x0.cccccccccccccccc (hexadecimal)

That's where the magic constant came from.  When we multiply the numerator by 0xcccccccccccccccc and take the upper half of the 128-bit full product, there is an implicit binary point to the left of the reciprocal value, and another right between the upper and lower halves of the full product.  This means that the upper half of the product equals floor(4x/5), and thus by shifting that value right by two bits we have floor(floor(4x/5) / 4), which is floor(x/5).

One need not do this work by hand unless the compiler is pretty lame.  Instead, while keeping in mind a general fear and loathing of expensive operations like general integer division on most processors, it's important to recognize that the special case of a constant denominator can be cheap.