Peter's Glossary of Vectorization Terminology

Compiler writers, architects, performance analysts, and application programmers have built up a fair number of confusing vectorization buzzwords over the years at Cray Research. This document attempts to define some of those terms for readers who may be new to vectorization, especially SGI/MTI engineers.

Bidirectional Memory (BDM)

The ability of most Cray vector processors to execute a vector load instruction in parallel with a vector store instruction, with the consequent headache for the compiler (and assembly language programmer) to ensure that the address streams are disjoint.

On the X-MP and its successors, bidirectional memory was a mode. That made it possible to disable when a compiler or hardware bug was suspected. It also made it possible to use the "compatibility modes" of the Y-MP, C-90, and Triton, which ran old codes on new hardware but unfortunately used the "new" memory ordering rules in doing so.

On the Cray-3 there were special bidirectional forms of the strided memory reference instructions.

Bit Matrix Multiplication

An operation of a special functional unit that was available in a limited 32-bit form on some Y-MP processors and in a full 64-bit form on the C-90 and Triton processors.

Although bit matrix multiplication has a fearsome reputation for complexity, in principle it is a very simple operation. Let f(X,Y) be a function on two binary integers, defined as 1 if AND(X,Y) has an odd number of set bits (even parity) and as 0 otherwise.

f(X,Y) produces a single bit from two 64-bit operands. Now define g(S,V[]) as a function between a 64-bit scalar value and a 64-element vector, producing a 64-bit binary value. Let the most significant bit of g(S,V[]) be f(S,V[0]), the next bit be f(S,V[1]), and so forth down to the last bit f(S,V[63]). This function g(S,V[]) is the form of bit-matrix multiplication available as a scalar/vector operation.

Bit-matrix multiplication between two vectors, h(Va[],Vb[]), is defined simply in terms of g(S,V[]). The first element of h(Va[],Vb[]) is g(Va[0],Vb[]), the second element is g(Va[1],Vb[]), and so on.

But what is bit-matrix multiplication good for? Well, it's an easy way to permute the bits of its arguments (analogous to "centrifuging" on the Cray MPP systems) and to look for matching elements in two vectors. One presumes that special applications can also exploit the bit-matrix multiplication unit effectively.

Cray compilers don't use bit-matrix multiplication operations for vectorization automatically. Special intrinsic functions must be used in application source code instead.

CAL

Cray Assembly Language, which was based on expressions rather than mnemonics. Each line could be read as an assignment statement, with a result register on the left-hand side and a little expression on the right-hand side.

Chaining

The ability to begin reading the new elements written to a vector register before the entire register write is complete. The X-MP, its descendants, and the Cray-4 support chaining; the Cray-2 and Cray-3 did not; the Cray-1 would chain if the later vector operation was issued in a single, specific clock period (the "chain slot") after the issue of the earlier one.

Chaining permitted instruction issue to get way ahead of execution. If the functional unit and result register of a vector instruction were available, it could issue, even if the operand registers would not receive data for many clocks. Chaining is the extension of pipelining to multiple vector instructions.

Chain Slot

A delay time measured in clock periods and specific to each functional unit, the "chain slot" of a given vector instruction is relative clock period in which the first element of the result will arrive from the unit at the result register. The term arose on the Cray-1, whose processor had just a single running pointer for each vector register to serve both reads and writes. Chaining would occur if an instruction that read a vector register issued in the chain slot clock period of the instruction that wrote a vector register, since the single pointer could be used to write and read the register simultaneously.

Chaining became more flexible on the later machines that still supported chaining at all, but it was still important to be aware of the chain slot times when scheduling code.

Chime

A dynamic sequence of vector instructions that will execute in parallel. They might be independent, or they may chain, or both. Chained vector operations are necessarily in the same chime.

Chimes are a useful abstraction for performance analysis. If the floating-point addition and floating-point multiplication functional units are used in each chime, the loop in question will run as close to peak megaflops as you can get. Or, put another way: if you're optimizing a loop with 7 multiplications in it, and you've gotten it down to 7 chimes, stop; you can't do any better.

Old-timers at Cray are pretty good at glancing at Fortran code and estimating performance in megaflops. There's no magic involved; they're just counting chimes.

Chime Break

The vector instruction that holds issue (usually for a long time) before beginning a new chime. If you use a hardware performance monitor to analyze instruction issue on highly vectorized code, you'll see that most of the time is spent holding issue at chime breaks.

Chime Time

The number of vector elements in a machine's vector register divided by the number of pipes and augmented by an overhead clock or two. The chime time on all past and present Cray hardware is about 64 clocks. This is the number of clocks in which the compiled code has to issue all the vector operations of a chime and do all the bookkeeping operations like incrementing the base pointers of memory references.

CIGS

Short for "Compressed Index / Gather / Scatter." The name of a feature of most X-MPs and all X-MP descendants that provides a simplified "compressed iota" operation and indirect vector memory addressing. See Iota, compressed and Gather/Scatter.

CMR

An assembly language mnemonic (yes, CAL had some) for "Complete Memory References." The CMR instruction holds issue until all previously issued memory instructions are irrevocably committed to memory. CMR is used to enforce memory ordering for multiprocessor synchronization as well as single-processor vector codes. It appeared on the X-MP along with multiprocessing and bidirectional memory. The C-90 and Triton support some lighter-weight forms of CMR for intraprocessor ordering.

See also pseudo-CMR.

Compressed Iota

See Iota, Compressed.

Compression Loop

A loop with a conditionally-incremented induction variable of the form

J = 0
DO I=1,N
   IF (...) THEN
      J = J + 1
      A(J) = B(I)
   END IF
END DO
Easily vectorized by computing a mask from the IF condition, using it to generate a compressed iota vector, doing a gather from B(), and then a regular strided store into A().

Data Dependence Analysis

The analysis of the address computation expressions in a pair of memory references to determine, as precisely as possible, the conditions in which they refer to the same word. Usually defined in terms of relations between loop index variables. Unless data dependence analysis can prove two references are independent, there will be an ordering relationship that the compiler has to observe.

Density Time Operation

In a proposed Cray architecture in which every vector operation was controlled by a mask, only the operations corresponding to set mask bits would consume any time. If one bit was set in the mask, the operation would take one clock.

Early exit loop

An inductive loop that may not complete all of its iterations due to a loop-terminating conditional jump, as in

DO I=1,N
   D(I) = E(I)
   IF (A(I) .LT. 0) GO TO 666
   B(I) = C(I) * X
END DO
In general, compilers can vectorize early exit loops when it is safe to over-evaluate the IF condition and the dependence graph permits the IF condition(s) to be moved to the top of the loop. Multiple IFs can be handled by ORing masks together and applying a leading zero count to find the first set bit in any mask.

Element

A 64-bit part of a vector register, analogous to an array element. The Cray-1 and its descendants allow values to be moved between scalar registers and indirectly indexed elements of vector registers.

Expansion Loop

The simple expansion loop is an analogue to the compression loop above. A more challenging case is the "expand with drag" loop:

X = ...
J = 0
DO I=1,N
   IF (...) THEN
      J = J + 1
      X = B(J)
   END IF
   A(I) = X
END DO
The Cray-3 compiler was able to vectorize this loop using a scatter instruction with a carefully constructed index vector.

The continuous iota function proposed for a new vector instruction set architecture has a natural application to the expansion-with-drag loop.

Functional Unit

See Unit, Functional.

Gather

A vector load instruction in which the addresses of the elements are computed as the sums of an invariant scalar base address and the elements of a vector register. Gathers are used to vectorize nested array references:

DO I=1,N
   A(I) = B(IX(I))
END DO
By using a zero scalar base address, gathers can be used to vectorize C language pointer dereferences. They can also be used in IF statement vectorization (with index vectors computed from compressed iota operations).

Apart from the Cray-1 models and some early X-MPs, all Cray vector machines have supported gather and scatter instructions, though often with performance limitations on multiple simultaneous gathers or scatters.

Gather, Double

A Triton feature that exploits the common need to do multiple gathers with the same index vector, as in this loop:

DO I=1,N
   A(I) = B(1,IX(I)) + B(2,IX(I))
END DO
The Triton's double gather instruction uses one index vector and two base address values to gather two separate vectors at once. Due to severe instruction set encoding restrictions, the double gather instruction has to use one register designator field to denote both the index vector and one of the base address registers!

Hold Issue

Cray's proprietary processors fetch, issue, and execute instructions in order (though they complete out of order). An instruction holds issue if its result register cannot yet be written, an operand register is not yet available, or if the instruction's functional unit is busy.

IF conversion

Replacing IF statements in loops through the use of masked vector operations, gather/scatter, or the merge-store technique.

Induction Variable

An object with loop-invariant addressing (usually just a scalar variable) whose values during loop execution constitute a well-behaved sequence. A DO-loop control variable is an induction variable, of course.

Iota, Compressed

A function of a mask and a fixed stride that creates a vector value from these scalar quantities. The result has as many elements as the mask has bits set. Each element's value is the index of a mask bit (counting from the left, with the most-significant bit being index zero) times the stride.

Examples of compressed iota:
Mask bitsStrideCompressed iota vector
11111111...10 1 2 3 4 5 6 7 ...
01010101...11 3 5 7 ...
01010101...22 6 10 14 ...
Most X-MPs and all of the X-MP descendants can generate a unit-stride compressed iota vector as a side effect of mask generation (which is via comparison of a vector to zero). The Cray-2 and Cray-3 had a separate compressed iota instruction that corresponds to the definition here. The Triton also has this general form now, but requires the mask to be in the VM register rather than a general-purpose scalar register.

Iota, Continuous

A recent invention, not yet implemented in hardware, that would provide a way of doing compression and expansion without the register-to-register operations of Triton that are an implementation problem in multiple-pipe processors. The continuous iota function creates a vector from a mask, a length, and a fixed stride. Each element's value is the running count of set mask bits, multiplied by the stride.

Examples of continuous iota:
Mask bitsStrideContinuous iota vector
11111111...10 1 2 3 4 5 6 7 ...
01010101...10 0 1 1 2 2 3 3 ...
00001000...10 0 0 0 0 1 1 1 ...
00001000...20 0 0 0 0 2 2 2 ...
The elements of a continuous iota vector model the values taken by a conditionally incremented induction variable.

Iota Substitution

A step in vectorization. After determining that a loop can be safely vectorized, the uses of induction variables in the loop are replaced with calls to iota functions. Expressions and addressing that use these iota functions are then progressively simplified via algebraic identities. Array references with vector-valued subscripts are then replaced with gathers and scatters, unless their entire linearized addressing expressions have become iotas, in which case they become strided memory references.

Example (not including strip-mining or other transformations):


DO I=1,N
   A(I) = B(I*2)
END DO
After linearization of addressing on a word-addressible machine:

DO I=1,N
   store (base(A)+I-1, load (base(B)+I*2-1)
END DO
After iota substitution and throwing the loop control away:

   scatter (0,
            base(A)+iota(-1,1)+1-1,
            gather (0, base(B)+(iota(-1,1)+1)*2-1))
After applying algebraic identities and collecting scalar terms:

   scatter (base(A)-1,
            iota(-1,1),
            gather (base(B)+1, iota(-1,2)))
After replacing gather/scatter from full-mask iota indices with strided load/store:

   vstore (base(A)-1, 1, vload (base(B)+1, 2))

IVDEP

A compiler directive by which the programmer asserts that a loop has no "vector dependences", essentially telling the compiler that any unresolved data dependence questions are to be assumed safe. The specific semantics of an IVDEP directive have become messier as compilers have gotten smarter.

Kernel Scheduling

A combined instruction scheduling and vector register assignment technique that implemented polycyclic scheduling on the Cray-2 and Cray-3, providing good performance on long loops despite the absence of chaining.

Latency

Transit time through a functional unit, or response time on a load operation from memory.

Leading Zero Count

An integer operation provided in scalar form on all Cray Research proprietary processors and in vector form on all but the earliest Cray-1 systems. Returns the number of contiguous zero bits in the most significant positions of the 64-bit word. If the most significant bit of the argument is set, the result is zero; if the argument is zero, the result is 64.

The leading zero count function is useful when scanning through bit masks. It can also be used to find the rightmost set bit in a mask efficiently. We even use it for the logical NOT operator of C ("!").

Local Memory

A small (128KB) fast memory within each processor of a Cray-2 or Cray-3 system, accessible only via special instructions executed by that processor (hence the name). Vectors could reference local memory only via unit stride loads and stores. Local memory was useful for spilling vector registers in compiled code. It was also used in hand-tuned libraries as a program-managed cache for codes with temporal locality, effectively providing a second memory port. Local memory could not be managed as a stack by software due to restricted scalar addressing modes (no indirect+offset mode). Generally frustrating to compiler writers; not carried forward into the Cray-4.

Mask

In general, a vector of single-bit values. Masks fit in one 64-bit scalar register on most Cray machines, but need 2 registers on a C-90 or Triton, where the maximum VL is 128. Masks are generated as the results of integer comparisons between vectors and zero, and also from IEEE floating-point comparisons on the most recent Tritons. Masks are presently used as input to merge operations and to the Triton's iota, compression, and expansion instructions.

Masked Vector Operation

In some proposed vector instruction set architectures, a vector mask would control nearly every vector operation. Zero bits in the mask would cause no-ops. Masked vector operation seems to provide a cleaner way to vectorize IF statements than present techniques.

Merge, Vector

An operation between a vector, a mask, and either a scalar or another vector. The elements of the vector result are selected from the operands by the mask bits.

Merge-Store technique

A way to vectorize IF statements containing only stores to arrays. The new values to be stored are merged under mask with the old array elements, so that the elements corresponding to false IF conditions are just loaded, merged, and stored back unchanged. This technique requires some redundant computation and care must be taken to avoid spurious interrupts.

Outer Loop Vectorization

Vectorizing a level of a loop nest other than the innermost.

Pack Loop

See Compression Loop.

Partial Vectorization

Vectorizing a loop in the presence of a hard data dependence cycle, leaving some operations in a scalar form. For example:

DO I=2,N
   A(I) = A(I-1) + B(I)*C(I)*D(I)*E(I)
END DO
would be partially vectorized by the Cray-3 compiler by using vector instructions for this expression:

   B(I)*C(I)*D(I)*E(I)
and then using the elements of the result vector as operands to the scalar operations of the inner part of the strip-mined loop nest.

Pipe

A collection of independent functional units and a slice of each vector register. The C-90 and Triton are 2-pipe machines, meaning that each vector register is interleaved across the two pipes (even elements on one, odd elements on the other). Vector instructions are dispatched to all pipes simultaneously, and, for most instructions, the pipes can execute independently. 4-pipe and 8-pipe machines have been proposed at Cray Research. The NEC SX-4 is an 8-pipe machine. Performance studies show that the payoff from multiple pipes drops off pretty quickly between 4 and 8 pipes.

Multiple pipes are generally invisible to the programmer, except for the acceleration they provide to highly vectorized codes with longer loop lengths.

Pipelining

Designing a functional unit so that a new operation can begin in each cycle, even though any particular operation will require multiple cycles to complete. If a unit has latency L, a pipelined implementation will require L + N - 1 cycles for N independent operations, as opposed to L * N cycles for a non-pipelined implementation. Also called "segmentation" of a functional unit (as in the CDC 7600).

Polycyclic Scheduling

In general, any form of instruction scheduling that causes multiple iterations of a loop to be active in parallel. In scalar compilers, polycyclic scheduling is often achieved by software pipelining or unrolling. Vectorizing compilers for Cray machines with chaining need to use only simple unrolling to get good performance from vector strip-mining loops, but compilers for machines without chaining have required ambitious scheduling techniques.

Population Count

A bitwise operation available in scalar form on all proprietary Cray processors and in vector form on all but the earliest Cray-1 systems. The population count of a 64-bit value is simply the number of bits set, an integer in the range 0 to 64.

When vectorizing an IF statement via the compressed index method, the population count of the mask becomes the vector length of the conditional code.

The population count of the exclusive OR of two values is called the "Hamming distance" and is useful in pattern matching. Population count can also be used to implement an efficient "trailing zero count" function.

The rightmost bit of the population count is the (even) parity function. Parity is so important to some customers that all Cray machines have also provided a form of the population count function that returns only the low-order bit.

Port

An abstraction of an independent path between a processor and the memory system crossbar; like a "functional unit" for memory. The Cray-1 and Cray-2 processors had single memory ports that had to carry both loads and stores; only one vector memory reference instruction could be active at a time. The X-MP and its descendants are three-port machines in which two loads and one store can all be simultaneously active.

The combination of multiple ports and multiple pipes provides the enormous per-processor bandwidths of the C-90 and Triton machines.

Pseudo-CMR

Using the side-effects of a vector instruction to force some ordering of operations without resorting to the relatively expensive CMR instruction. For example, to hold issue until a specific vector load has completed, perform a dummy vector element transfer from the loaded vector into a scratch scalar register.

PVP

A marketing term coined to distinguish Cray's symmetric parallel systems from its then-new MPP machine. Stands for Parallel Vector Processor, even though it pertains to a class of system architecture.

Recursion

A bug on the Cray-1 that turned out to be useful. As noted elsewhere, the Cray-1 had a single element pointer per vector register. If the same vector register was specified as both an input and the output register of a computational instruction (say, a floating-point addition), the elements of the result stream would be used as the operand elements in all but the first few element positions! Depending on the vector length and the latency of the operation, a value might cycle through the unit seven or eight times.

This behavior turned out to be a good way of performing the final step of a reduction loop, if you didn't mind having the correctness of compiled code depend on a fixed latency assumption.

Reduction

A data dependence cycle in which a scalar variable is computed by a commutative operation involving itself. Sum reductions, like:

X = 0.0
DO I=1,N
   X = X + A(I)
END DO
are common and important. Reductions can be vectorized by doing a vector's worth of partial sums in each strip, then adding up the vector's elements after the loop is done. This technique can change the results of reductions of floating-point addition and multiplication, unfortunately, so Cray compilers have switches that specifically disable vectorization of floating-point reductions.

Remainder

When strip-mining, usually by a maximum hardware vector length, there's a short strip that has to be run before or after the full-length strips. Also comes up in unrolling.

Restructuring

A generic term for optimizing transformations that alter the execution of loop nests, including interchange, reversal, distribution, fusion, unrolling, peeling, strip-mining, and vectorization itself. Restructuring transformations must preserve the semantics of the program, and use dependence analysis to discover the ordering relationships that have to be observed.

Runtime Dependence

See symbolic analysis.

Safe VL

Using vector strips of possibly limited length to permit vectorization in the face of a loop-carried data dependence. The loop:

DO I=11,N
   A(I) = A(I-10) * X
END DO
can be safely vectorized so long as the strips are at most 10 elements long and proper bidirectional memory protection is in place between the strips. In cases like:

DO I=J,N
   A(I) = A(I+K) * X
END DO
the safe VL must be computed at runtime. When there's more than one potential dependence in the loop, runtime code computes the safe VL as a MIN function. The CFT77 compiler, which had difficulty duplicating code internally, even used the runtime safe VL computation technique to handle potential data dependences like:

DO I=2,N
   A(I,J) = A(I-1,K) * X
END DO
by computing a safe VL of 1 at runtime if J equals K, otherwise using the maximum hardware vector length. Note that any loop can be "vectorized" safely (with respect to data dependences) with a strip length of one, though performance may suffer when compared to scalar code.

Scan Loop

Like a reduction, except that the stream of intermediate values is significant.

DO I=1,N
   X = X + A(I)
   B(I) = X
END DO
Scans are generally not vectorizable, but partial vectorization often can come into play. The term "scan" is from Iverson's APL language (as are also "iota", "compress", "expand", and "reduction", at least).

Scatter

The analogue of a gather for indirect vector store operations. Cray hardware guarantees that scatters with duplicate index vector elements are well-behaved.

Search Loop

A case of an early exit loop in which the value of the loop control variable is used after the loop. Historically difficult for Cray compilers to handle, surprisingly.

Strided Reference

A memory reference to multiple objects, addressed via a base address and a sequence of integer multiples of a fixed stride value. A strided vector load is conceptually equivalent to a gather from the same base address with an index vector generated from a compressed iota from a saturated mask.

Some folks use the word "strided" only if the stride is discontiguous.

Strip-mining

A loop restructuring transformation in which a loop becomes a two-level nest whose inner loop is limited to the stride of the outer loop. Thus:

DO I=1,N
   ...
END DO
becomes:

DO I1=1,N,MAXSTRIP
   I2 = MIN (N,I1+MAXSTRIP-1)
   DO I=I1,I2
      ...
   END DO
END DO
Vectorization for Cray hardware requires strip-mining because the iteration space has to be partitioned into chunks that are no longer than the length of a vector register. After strip-mining, the operations of the inner loop are rewritten in vector form and the inner loop itself, so recently created, is discarded.

Strip-mining has become useful on scalar machines with caches as part of a loop restructuring transformation called "tiling", in which strip-mining is coupled with loop interchange to optimize spatial and temporal locality.

Symbolic analysis

A means of resolving potential data dependence questions that are determined by the values of invariant scalars. For example,

DO I=1,N
   A(I) = A(I+J) * X
END DO
can be vectorized safely if J is nonnegative, or run with shorter strip lengths if J is negative. Some compilers can answer these questions from earlier IF statement conditions, such as:

IF (J .LT. 0) STOP
What cannot be deduced at compilation time can often be cheaply tested during execution to select a version of code optimized under specific assumptions.

More generally, symbolic analysis is an approach to data dependence analysis in which the compiler constructs a system of linear integer equations and attempts to solve it without violating constraints. If no solution is possible, no dependence exists. If a solution is possible, the reduced system of equations can be translated into a runtime test. The Cray-3's vectorizing compiler used symbolic analysis exclusively for data dependence analysis after an efficient algorithm was discovered.

Tailgating

A capability of the Cray-3 and some very late model Cray-2 machines. Tailgating allowed a vector instruction to begin writing the elements of a vector register without waiting for a prior instruction that was reading the vector to complete. This behavior is the opposite of chaining.

Tailgating effectively increases the number of vector registers, since any vector register whose last use is in a given chime can be immediately used for the result of a later instruction in the same chime. This feature relieved register pressure, which was especially bad on the Cray-2 and Cray-3 machines due to their need for ambitious polycyclic scheduling for performance in the absence of chaining.

Triton

Internal code-name for the Cray T-90 product line. There are two kinds of Triton processors. The first has a C-90 compatibility mode (which unfortunately uses Triton memory ordering rules) and a Triton native mode with Cray-1 floating-point arithmetic. There is also the "IEEE Triton" processor that is incompatible with everything, since it uses IEEE 754 64-bit floating-point data formats. Unfortunately, though the IEEE Triton required applications to be recompiled, the instruction set wasn't cleaned up; it has the same small register sets of the Cray-1 architecture.

Unpack Loop

See Expansion Loop.

Unrolling

A loop transformation in which multiple loop iterations are executed together, possibly with operations intermixed by scheduling. Single-chime vector loops benefit from modest unrolling because register conflicts can be avoided.

Unit Busy Time

The number of clock periods for which a vector instruction will reserve a functional unit. On Cray machines, unit busy time is the vector length divided by the number of pipes, plus a small overhead delay.

Unit, Functional

The part of the processor that computes a basic function. Vector functional units are heavily pipelined. All Cray machines have had separate floating-point addition and floating-point multiplication functional units, with various other units dedicated to integer operations, shifts, logicals, and so forth (often in combination). A functional unit can be used at most once in a chime.

Update Loop

An important gather/scatter loop of the form

DO I=1,N
   A(IX(I)) = A(IX(I)) + B(I)
END DO
The B(I) term can be any vectorizable expression. The problem with vectorizing the update loop stems from the possibility of having duplicate entries in the IX(I) array. Cray compilers vectorize the update loop via a patented technique in which a scatter of an iota vector is used to check for duplicates, which are fixed up in out-of-line scalar code that relies on the commutativity of addition.

Vector Register

A register file with a many entries but only 1 read and 1 write port. The elements are read in succession, and generally written that way (exception: Cray-2 and Cray-3 memory loads).

Vectorization

The conversion of scalar loops into strip-mining loops with vector instructions, in particular, and the necessary analysis and transformations making the conversion legal, in general.

VFUNCTION

A feature of Cray compilers that indirectly allows the vectorization of loops containing user function calls. The VFUNCTION F directive informs the compiler that an external function Fexists in three versions: Further, VFUNCTION permits the compiler to assume that the external function is free of side effects.

The last release of the CFT77 compiler can generate multiple versions of a function with the right naming and argument passing conventions to permit the user to write VFUNCTIONs in Fortran.

VL

The name of the Vector Length control register. The value of VL determines the length of vector operations. The maximum value of VL is the same as the number of elements in the vector registers.

The interpretation of values stored into the VL register has changed from one generation to the next.
Value StoredCray-1X-MP, Y-MP Cray-2Cray-3C-90, Triton
064 *6464 (0)0128
111111
64646464 (0)6464
651116465
128646464 (0)64128
129111641

NOTES: The Cray-1 had no instruction for reading the VL register. Although the Cray-2 did have a read VL instruction, it returned 0 when the effective VL value was 64! The Cray-3 allowed VL to be set to zero (turning vector instructions into no-ops) and interpreted values greater than 64 as 64. The C-90 and Triton processors have vector registers with 128 elements.

VM

The name of the Vector Mask control register. VM receives the results of comparison instructions and is an operand to merge instructions, as well as Triton's iota, compression, and expansion instructions. It is possible to transfer between VM and the general purpose scalar registers.

VM has as many bits as vector registers have elements, since it is conceptually a vector of 1-bit values. The most significant bit of VM (as interpreted as an integer) corresponds to element 0 of the vectors. On the C-90 and Triton, VM occupies two 64-bit words.