Tales
of Bit-Twisting Terror
Episode
1: The XORed Shift Count Trick
Linux. We were finally
porting the Linux kernel to our machine!
Writes to the filesystem weren't working. Reads were okay, but
writes were corrupting the disk. Our I/O wizard tracked the
problem down to the filesystem's disk block allocator.
It turned out that the Ext2 filesystem maintains bit maps of allocated
disk blocks. Each bit in the map corresponds to a block in a
block group. The map occupies a single block on the disk.
When the filesystem is configured to use 1024 bytes per block (two
sectors), the map has 8192 bits, and the block group size is
necessarily 8MiB.
To write a new block to the filesystem, Ext2 scans the bit map.
Bits that are set correspond to blocks that have already been
allocated. So Ext2 scans the bit map for a bit that's zero.
A really obvious but slow way to perform that scan would be to loop
through the bytes of the allocation map until a byte is found that does
not have all of its bits set; in other words, a byte not equal to 255
(hex 0xff).
Such a byte has at least one bit that's clear, and one could identify
it by repeatedly shifting its value rightwise until its lowest-order
bit was zero. This would work because Ext2 uses the common
convention that bit 0 of a byte is its least-significant bit, and bit 7
is its most-significant bit.
But heck, we're on a Cray machine with 64-bit registers and special
bit-twisting instructions. Surely there's a faster way of
scanning the allocation map for a zero bit!
There's a complication here that needs to be taken into account.
The Cray is a big-endian machine, but Ext2 always uses a little-endian
byte order for data on the disk. This makes great sense for a
filesystem developed on IBM PC-compatible hardware, which uses Intel
microprocessors, which are little-endian.
What's the difference? Well, if you're only accessing data a byte
at a time, there isn't one. But processors that can reference
data that's larger than a single byte unit need to choose a byte
ordering for those multi-byte transfers. A little-endian
processor storing a 32-bit value to a given byte address in memory will
store the least-significant byte of the value to the first byte in
memory and the most-significant byte of the value to the fourth byte in
memory. A big-endian processor will store the data in the
opposite order.
(Both little-endian and big-endian orderings have their proponents, but
everybody agrees that any machine's data ordering should be just one or
the other. This has not always been the case!)
The Cray instruction set architectures have always supported a useful
operation called the Leading Zero Count instruction, and so does our
newest machine. It scans a 64-bit or 32-bit value and returns the
big-endian index of the highest-order bit that's set in the word.
If a word's sign bit is set, Leading Zero Count returns zero; if
the entire word is zero, Leading Zero Count returns the number of bits
in the word.
Now if we had ourselves a little-endian processor in our machine, what
we would really want to be using for this algorithm would be a Trailing Zero Count
instruction. We would just scan the allocation bit map for a
64-bit word that was not equal to -1, take its one's-complement, and
perform a Trailing Zero Count to find the little-endian index of the
rightmost set bit. When we were arguing years back about the
instruction set that this machine should implement, a Trailing Zero
Count instruction was briefly considered, but we realized that we don't
need it because a Trailing Zero Count function can be cheaply
synthesized from our Bit Population Count operation. This
instruction just counts the number of bits that are set in its
argument. And you can use it to implement a Trailing Zero Count
function for a nonzero argument x
by simply computing pop(xor(x, x-1))
- 1. If that idiom is unfamiliar, try it with some sample
values to see how it works. It depends on the fact that
subtracting 1 from a nonzero value will complement its lower-order
bits, up to and including the lowest-order bit that's set.
Pardon the digression. We don't have a little-endian
processor. And my initial attempt at using big-endian Leading
Zero Count on Ext2 block allocation bit maps was failing. After I
fixed it, our I/O wizard stopped by and actually congratulated me for
the trick used in the new code, which motivated me to share it with
you. This is the first code that I can remember writing that
contains both an AND and an XOR by the magic value of 56.
Each 64-bit word contains, of course, eight bytes. From the
perspective of Ext2's block allocator, these 64 bits on a little-endian machine would
represent (say) blocks 0 through 63 from the least-significant bit up
to the sign bit. On our big-endian machine, the bytes appear to
have been reversed, but not the bits therein, so these 64 bits
represent blocks 56-63 in the low-order byte, 48-55 in the next byte,
and 0-7 in the high-order byte, with the sign bit corresponding to
block 7.
Assuming that the 64-bit word contains at least one clear bit, we can
take its one's-complement and have a nonzero result. The next
step is to use Leading Zero Count to find the byte containing that bit
position. This is simply a matter of discarding the low-order
three bits of the Leading Zero Count value to align it to a big-endian byte boundary.
Knowing then the index of the byte containing the first free block's
bit position, we simply have to find that bit's index within the byte.
This is where the Trailing Zero Count idiom that I mentioned earlier
comes into play. All we really have to do is compute the Trailing
Zero Count value of the byte that we've just found. But the
Trailing Zero Count idiom only works with right-justified data.
So we have to shift the byte rightwards into the low-order byte
position. What shift count should we use, given that we have the
big-endian index of the uppermost bit in the byte? (Equivalently,
we could use the Trailing Zero Count idiom and subtract a left-shifted
value of 1.)
The trick is to shift the data rightwards with a shift count equal to
the big-endian bit position XORed with 56. Does it work?
Let's check. Clearly, this will move the uppermost byte into the
lowermost position (0 ^ 56 == 56 bits); the second byte also seems to
work (8 ^ 56 == 48 bits), and yes, so do the other six bytes.
But why does this work?
It's a specific application of a useful but little-known general rule
for endianness conversions. Suppose that in a block of m memory units (bytes, bits, or
whatever), we need to reverse the apparent order of the n-unit partitions -- and assume
that m and n are powers of two. The
mapping is this: map(i) = xor(i,
m-1, n-1).
In this case, we're swapping 8-bit partitions in groups of 64 bits, so
we xor the index by xor(64-1, 8-1) == xor(63,7) == 56.
Okay, fine, but why does that work? It's because of a
perhaps more general trick: subtraction from a value that's all one
bits is equivalent to complementing the field.
The final version of the fixed algorithm, less the comments that
summarize this story, is as follows:
for (i = 0; i <
words; i++)
if ((x = ~addr [i])) {
j = _lz (x) & 56;
x >>= j ^ 56;
j += _pop (x ^ x-1) - 1;
j += i << 6;
return j >= size ? size : j;
}
return size;
And that's how we got Ext2 fixed
so we could move on to the next problem.