A piece of hardware had five
memory-mapped registers implementing an array of five words. For
obscure but important reasons, the words could not be ordered as one
might expect: [0] first and [4] last. No, instead, the
memory-mapped registers in address order were actually going to contain
the elements of the array in the order [1], [2], [3], [4], [0].
Odd, but it could be worse. I coded up some access macros that would
map an index into this array. Being once upon a time a
mathematician, I knew that the simple mapping (i-1) mod 5 would do the trick to
convert a "normal" index into the rotated form. And so would (i+4) mod 5, and this variant might
be somewhat safer when used with unsigned
data.
The resulting macro elicited some groans. "We have to do a division to access an MMR?
This is going to slow down a hot path through the interrupt handler!"
Not really. You see, when we designed the instruction set for the
Cray X1 processor, we made sure that it would support a variant form of
the integer multiplication instruction. Binary multiplication of
two n-bit values yields a
product with 2n bits in
it. A normal integer multiplication function truncates that
product and returns its lower half. The variant form, however,
returns the upper half of the product. (This instruction exists
on the DEC Alpha processor as UMULH,
and the upper half of a binary product is also available by other means
on the MIPS and Intel x86 processors too.)
Besides its obvious value for multiprecision arithmetic, being able to
recover the upper half of a binary product makes integer division by a
constant value into a cheap operation, at least when compared to the
general integer division instruction (assuming that a processor even
has one). And if division by a constant is inexpensive, then so
is a modulus.
The technique is straightforward, though the resulting machine
instructions seem somewhat magical. The idea is to replace an
integer division by a constant value with a multiplication by the value's
fixed-point reciprocal value.
To implement a fast unsigned 64-bit division by five, as in this case,
you just multiply by 14757395258967641292,
take the upper half of the 128-bit full product, and shift it right two
bits.
Whoa.
This makes much more sense if you consider the binary values involved
as fixed-point numbers and
watch where the binary point is placed in them. Everybody knows
what floating-point values look like (IEEE-754 nearly exclusively
today) and more or less how they work, but non-integral fixed-point
numbers are somewhat less commonly known and used.
The binary reciprocal of five is a repeating fraction.
That's where the magic constant came
from. When we multiply the numerator by 0xcccccccccccccccc
and take the upper half of the 128-bit full product, there is an
implicit binary point to the left of the reciprocal value, and another
right between the upper and lower halves of the full product.
This
means that the upper half of the product equals floor(4x/5), and thus by shifting
that value right by two bits we have floor(floor(4x/5)
/ 4), which is floor(x/5).
One need not do this work by
hand unless the compiler is pretty lame. Instead, while keeping
in mind a
general fear and loathing of expensive operations like general integer
division on most processors, it's important to recognize that the
special case of a constant denominator can be cheap.