Tales of Bit-Twisting Terror

Episode 1: The XORed Shift Count Trick

Linux.  We were finally porting the Linux kernel to our machine!

Writes to the filesystem weren't working.  Reads were okay, but writes were corrupting the disk.  Our I/O wizard tracked the problem down to the filesystem's disk block allocator.

It turned out that the Ext2 filesystem maintains bit maps of allocated disk blocks.  Each bit in the map corresponds to a block in a block group.  The map occupies a single block on the disk.  When the filesystem is configured to use 1024 bytes per block (two sectors), the map has 8192 bits, and the block group size is necessarily 8MiB.

To write a new block to the filesystem, Ext2 scans the bit map.  Bits that are set correspond to blocks that have already been allocated.  So Ext2 scans the bit map for a bit that's zero.

A really obvious but slow way to perform that scan would be to loop through the bytes of the allocation map until a byte is found that does not have all of its bits set; in other words, a byte not equal to 255 (hex 0xff).  Such a byte has at least one bit that's clear, and one could identify it by repeatedly shifting its value rightwise until its lowest-order bit was zero.  This would work because Ext2 uses the common convention that bit 0 of a byte is its least-significant bit, and bit 7 is its most-significant bit.

But heck, we're on a Cray machine with 64-bit registers and special bit-twisting instructions.  Surely there's a faster way of scanning the allocation map for a zero bit!

There's a complication here that needs to be taken into account.  The Cray is a big-endian machine, but Ext2 always uses a little-endian byte order for data on the disk.  This makes great sense for a filesystem developed on IBM PC-compatible hardware, which uses Intel microprocessors, which are little-endian.

What's the difference?  Well, if you're only accessing data a byte at a time, there isn't one.  But processors that can reference data that's larger than a single byte unit need to choose a byte ordering for those multi-byte transfers.  A little-endian processor storing a 32-bit value to a given byte address in memory will store the least-significant byte of the value to the first byte in memory and the most-significant byte of the value to the fourth byte in memory.  A big-endian processor will store the data in the opposite order.

(Both little-endian and big-endian orderings have their proponents, but everybody agrees that any machine's data ordering should be just one or the other.  This has not always been the case!)

The Cray instruction set architectures have always supported a useful operation called the Leading Zero Count instruction, and so does our newest machine.  It scans a 64-bit or 32-bit value and returns the big-endian index of the highest-order bit that's set in the word.  If a word's sign bit is set, Leading Zero Count returns zero;  if the entire word is zero, Leading Zero Count returns the number of bits in the word.

Now if we had ourselves a little-endian processor in our machine, what we would really want to be using for this algorithm would be a Trailing Zero Count instruction.  We would just scan the allocation bit map for a 64-bit word that was not equal to -1, take its one's-complement, and perform a Trailing Zero Count to find the little-endian index of the rightmost set bit.  When we were arguing years back about the instruction set that this machine should implement, a Trailing Zero Count instruction was briefly considered, but we realized that we don't need it because a Trailing Zero Count function can be cheaply synthesized from our Bit Population Count operation.  This instruction just counts the number of bits that are set in its argument.  And you can use it to implement a Trailing Zero Count function for a nonzero argument x by simply computing pop(xor(x, x-1)) - 1.  If that idiom is unfamiliar, try it with some sample values to see how it works.  It depends on the fact that subtracting 1 from a nonzero value will complement its lower-order bits, up to and including the lowest-order bit that's set.

Pardon the digression.  We don't have a little-endian processor.  And my initial attempt at using big-endian Leading Zero Count on Ext2 block allocation bit maps was failing.  After I fixed it, our I/O wizard stopped by and actually congratulated me for the trick used in the new code, which motivated me to share it with you.  This is the first code that I can remember writing that contains both an AND and an XOR by the magic value of 56.

Each 64-bit word contains, of course, eight bytes.  From the perspective of Ext2's block allocator, these 64 bits on a little-endian machine would represent (say) blocks 0 through 63 from the least-significant bit up to the sign bit.  On our big-endian machine, the bytes appear to have been reversed, but not the bits therein, so these 64 bits represent blocks 56-63 in the low-order byte, 48-55 in the next byte, and 0-7 in the high-order byte, with the sign bit corresponding to block 7.

Assuming that the 64-bit word contains at least one clear bit, we can take its one's-complement and have a nonzero result.  The next step is to use Leading Zero Count to find the byte containing that bit position.  This is simply a matter of discarding the low-order three bits of the Leading Zero Count value to align it to a big-endian byte boundary.  Knowing then the index of the byte containing the first free block's bit position, we simply have to find that bit's index within the byte.

This is where the Trailing Zero Count idiom that I mentioned earlier comes into play.  All we really have to do is compute the Trailing Zero Count value of the byte that we've just found.  But the Trailing Zero Count idiom only works with right-justified data.  So we have to shift the byte rightwards into the low-order byte position.  What shift count should we use, given that we have the big-endian index of the uppermost bit in the byte?  (Equivalently, we could use the Trailing Zero Count idiom and subtract a left-shifted value of 1.)

The trick is to shift the data rightwards with a shift count equal to the big-endian bit position XORed with 56.  Does it work?  Let's check.  Clearly, this will move the uppermost byte into the lowermost position (0 ^ 56 == 56 bits); the second byte also seems to work (8 ^ 56 == 48 bits), and yes, so do the other six bytes.

But why does this work?  It's a specific application of a useful but little-known general rule for endianness conversions.  Suppose that in a block of m memory units (bytes, bits, or whatever), we need to reverse the apparent order of the n-unit partitions -- and assume that m and n are powers of two.  The mapping is this:  map(i) = xor(i, m-1, n-1).  In this case, we're swapping 8-bit partitions in groups of 64 bits, so we xor the index by xor(64-1, 8-1) == xor(63,7) == 56.

Okay, fine, but why does that
work?  It's because of a perhaps more general trick: subtraction from a value that's all one bits is equivalent to complementing the field.


The final version of the fixed algorithm, less the comments that summarize this story, is as follows:

for (i = 0; i < words; i++)
    if ((x = ~addr [i])) {
       j = _lz (x) & 56;
       x >>= j ^ 56;
       j += _pop (x ^ x-1) - 1;
       j += i << 6;
       return j >= size ? size : j;
    }
return size;

And that's how we got Ext2 fixed so we could move on to the next problem.