Compiler writers, architects, performance analysts, and application
programmers have built up a fair number of confusing vectorization
buzzwords over the years at Cray Research.
This document attempts to define some of those terms for
readers who may be new to vectorization, especially SGI/MTI
engineers.
Bidirectional Memory (BDM)
The ability of most Cray vector processors to execute a vector
load instruction in parallel with a vector store instruction, with the
consequent headache for the compiler (and
assembly language programmer) to ensure that the address streams are disjoint.
On the X-MP and its successors, bidirectional memory was a mode.
That made it possible to disable when a compiler or hardware bug was suspected.
It also made it possible to use the "compatibility modes" of the Y-MP, C-90,
and Triton, which ran old codes on new hardware but unfortunately used the
"new" memory ordering rules in doing so.
On the Cray-3 there were special bidirectional forms of the strided memory
reference instructions.
Bit Matrix Multiplication
An operation of a special functional unit that was available in a
limited 32-bit form on some Y-MP processors and in a full 64-bit form
on the C-90 and Triton processors.
Although bit matrix multiplication has a fearsome reputation for
complexity, in principle it is a very simple operation.
Let f(X,Y) be a function on two binary integers, defined
as 1 if AND(X,Y) has an odd number of set bits
(even parity) and as 0 otherwise.
f(X,Y) produces a single bit from two 64-bit operands.
Now define g(S,V[]) as a function between a 64-bit scalar
value and a 64-element vector, producing a 64-bit binary value.
Let the most significant bit of g(S,V[]) be f(S,V[0]),
the next bit be f(S,V[1]), and so forth down to the
last bit f(S,V[63]). This function g(S,V[]) is
the form of bit-matrix multiplication available as a scalar/vector
operation.
Bit-matrix multiplication between two vectors, h(Va[],Vb[]),
is defined simply in terms of g(S,V[]).
The first element of h(Va[],Vb[]) is g(Va[0],Vb[]),
the second element is g(Va[1],Vb[]), and so on.
But what is bit-matrix multiplication good for?
Well, it's an easy way to permute the bits of its arguments (analogous
to "centrifuging" on the Cray MPP systems) and to look for matching
elements in two vectors. One presumes that special applications
can also exploit the bit-matrix multiplication unit effectively.
Cray compilers don't use bit-matrix multiplication operations for
vectorization automatically.
Special intrinsic functions must be used in application source
code instead.
Cray Assembly Language, which was based on expressions rather than mnemonics.
Each line could be read as an assignment statement, with a result register
on the left-hand side and a little expression on the right-hand side.
Chaining
The ability to begin reading the new elements written to a vector register
before the entire register write is complete. The X-MP, its descendants,
and the Cray-4 support chaining; the Cray-2 and Cray-3 did not; the Cray-1
would chain if the later vector operation was issued in a single, specific
clock period (the "chain slot") after the issue of the earlier one.
Chaining permitted instruction issue to get way ahead of execution. If the
functional unit and result register of a vector instruction were available,
it could issue, even if the operand registers would not receive data for many
clocks. Chaining is the extension of pipelining
to multiple vector instructions.
A delay time measured in clock periods and specific to each functional unit,
the "chain slot" of a given vector instruction is relative clock period in
which the first element of the result will arrive from the unit at the result
register. The term arose on the Cray-1,
whose processor had just a single running pointer for each vector register to serve both
reads and writes. Chaining would occur if an instruction that read a vector register
issued in the chain slot clock period of the instruction that wrote a vector register, since
the single pointer could be used to write and read the register simultaneously.
Chaining became more flexible on the later machines that still supported chaining at all,
but it was still important to be aware of the chain slot times when scheduling code.
Chime
A dynamic sequence of vector instructions that will execute in parallel.
They might be independent, or they may chain, or both.
Chained vector operations are necessarily in the same chime.
Chimes are a useful abstraction for performance analysis.
If the floating-point addition
and floating-point multiplication functional units are used in each chime,
the loop in question will run as close to peak megaflops as you can get.
Or, put another way: if you're optimizing a loop with 7 multiplications in it,
and you've gotten it down to 7 chimes, stop; you can't do any better.
Old-timers at Cray are pretty good at glancing at Fortran code and estimating
performance in megaflops.
There's no magic involved; they're just counting chimes.
Chime Break
The vector instruction that holds issue (usually for a long time) before beginning a new
chime. If you use a hardware performance monitor to analyze instruction issue on
highly vectorized code, you'll see that most of the time is spent
holding issue at chime breaks.
Chime Time
The number of vector elements in a machine's vector register divided by the number of
pipes and augmented by an overhead clock or two. The chime time on all past and
present Cray hardware is about 64 clocks. This is the number of clocks in which the
compiled code has to issue all the vector operations of a chime and do all the
bookkeeping operations like incrementing the base pointers of memory references.
CIGS
Short for "Compressed Index / Gather / Scatter." The name of a feature of most X-MPs
and all X-MP descendants that provides a simplified "compressed iota" operation and
indirect vector memory addressing. See
Iota, compressed
and Gather/Scatter.
An assembly language mnemonic (yes, CAL had some)
for "Complete Memory References." The CMR instruction holds issue until all
previously issued memory instructions
are irrevocably committed to memory.
CMR is used to enforce memory ordering for multiprocessor synchronization as
well as single-processor vector codes. It appeared on the X-MP along
with multiprocessing and bidirectional memory. The C-90 and Triton support
some lighter-weight forms of CMR for intraprocessor ordering.
See also pseudo-CMR.
Compressed Iota
See Iota, Compressed.
A loop with a conditionally-incremented induction variable of the form
J = 0
DO I=1,N
IF (...) THEN
J = J + 1
A(J) = B(I)
END IF
END DO
Easily vectorized by computing a mask from the IF condition,
using it to generate a compressed iota vector, doing a gather from B(),
and then a regular strided store into A().
Data Dependence Analysis
The analysis of the address computation expressions in a pair of memory
references to determine, as precisely as possible, the conditions in which
they refer to the same word.
Usually defined in terms of relations between loop index variables.
Unless data dependence analysis can prove two references are
independent, there will be an
ordering relationship that the compiler has to observe.
Density Time Operation
In a proposed Cray architecture in which every vector operation was
controlled by a mask, only the operations corresponding to set mask
bits would consume any time. If one bit
was set in the mask, the operation would take one clock.
An inductive loop that may not complete all of its iterations due to a
loop-terminating conditional jump, as in
DO I=1,N
D(I) = E(I)
IF (A(I) .LT. 0) GO TO 666
B(I) = C(I) * X
END DO
In general, compilers can vectorize early exit loops when it is safe to
over-evaluate the IF condition and the dependence graph permits the IF
condition(s) to be moved to the top of the loop. Multiple IFs can be
handled by ORing masks together and applying a leading zero count to
find the first set bit in any mask.
Element
A 64-bit part of a vector register, analogous to an array element.
The Cray-1 and its descendants allow values to be moved between scalar
registers and indirectly indexed elements of vector registers.
The simple expansion loop is an analogue to the compression loop above.
A more challenging case is the "expand with drag" loop:
X = ...
J = 0
DO I=1,N
IF (...) THEN
J = J + 1
X = B(J)
END IF
A(I) = X
END DO
The Cray-3 compiler was able to vectorize this loop using a
scatter instruction with a carefully constructed index vector.
The continuous iota function
proposed for a new vector instruction set
architecture has a natural application to the expansion-with-drag loop.
Functional Unit
See Unit, Functional.
A vector load instruction in which the addresses of the elements are computed
as the sums of an invariant scalar base address and the elements of a vector
register. Gathers are used to vectorize nested array references:
DO I=1,N
A(I) = B(IX(I))
END DO
By using a zero scalar base address, gathers can be used to vectorize C
language pointer dereferences. They can also be used in IF statement
vectorization (with index vectors computed from compressed iota operations).
Apart from the Cray-1 models and some early X-MPs, all Cray vector machines
have supported gather and scatter instructions, though often with performance
limitations on multiple simultaneous gathers or scatters.
Gather, Double
A Triton feature that exploits the common need to do multiple gathers
with the same index vector, as in this loop:
DO I=1,N
A(I) = B(1,IX(I)) + B(2,IX(I))
END DO
The Triton's double gather instruction uses one index vector and two
base address values to gather two separate vectors at once. Due to severe
instruction set encoding restrictions, the double gather instruction has to use one register designator field to
denote both the index vector and one of the base address registers!
Hold Issue
Cray's proprietary processors fetch, issue, and execute instructions in order
(though they complete out of order). An instruction holds issue if its
result register cannot yet be written, an operand register is not yet available,
or if the instruction's functional unit is busy.
IF conversion
Replacing IF statements in loops through the use of
masked vector operations,
gather/scatter, or the
merge-store technique.
An object with loop-invariant addressing (usually just a scalar variable) whose values
during loop execution constitute a well-behaved sequence.
A DO-loop control variable
is an induction variable, of course.
A function of a mask and a fixed stride that creates a vector value from these scalar
quantities. The result has as many elements as the mask has bits set. Each element's
value is the index of a mask bit (counting from the left, with the most-significant bit
being index zero) times the stride.
Examples of compressed iota:
| Mask bits | Stride | Compressed iota vector |
| 11111111... | 1 | 0 1 2 3 4 5 6 7 ... |
| 01010101... | 1 | 1 3 5 7 ... |
| 01010101... | 2 | 2 6 10 14 ... |
Most X-MPs and all of the X-MP descendants can generate a unit-stride
compressed iota vector as a side effect of mask generation (which is via
comparison of a vector to zero). The Cray-2 and Cray-3 had a separate
compressed iota instruction that corresponds to the definition here.
The Triton also has this general form now, but requires the
mask to be in the VM register rather than a
general-purpose scalar register.
A recent invention, not yet implemented in hardware, that would provide a
way of doing compression and expansion without the register-to-register
operations of Triton that are an implementation problem in multiple-pipe
processors. The continuous iota function creates a vector from a mask,
a length, and a fixed stride. Each element's value is the
running count of set mask bits, multiplied by the stride.
Examples of continuous iota:
| Mask bits | Stride | Continuous iota vector |
| 11111111... | 1 | 0 1 2 3 4 5 6 7 ... |
| 01010101... | 1 | 0 0 1 1 2 2 3 3 ... |
| 00001000... | 1 | 0 0 0 0 0 1 1 1 ... |
| 00001000... | 2 | 0 0 0 0 0 2 2 2 ... |
The elements of a continuous iota vector model the values taken by a
conditionally incremented induction variable.
Iota Substitution
A step in vectorization. After determining that a loop can be safely
vectorized, the uses of induction variables in the loop are replaced with
calls to iota functions. Expressions and addressing that use these iota
functions are then progressively simplified via algebraic identities.
Array references with vector-valued subscripts are then replaced with
gathers and scatters, unless their entire linearized addressing expressions
have become iotas, in which case they become strided memory references.
Example (not including strip-mining or other transformations):
DO I=1,N
A(I) = B(I*2)
END DO
After linearization of addressing on a word-addressible machine:
DO I=1,N
store (base(A)+I-1, load (base(B)+I*2-1)
END DO
After iota substitution and throwing the loop control away:
scatter (0,
base(A)+iota(-1,1)+1-1,
gather (0, base(B)+(iota(-1,1)+1)*2-1))
After applying algebraic identities and collecting scalar terms:
scatter (base(A)-1,
iota(-1,1),
gather (base(B)+1, iota(-1,2)))
After replacing gather/scatter from full-mask iota indices with strided
load/store:
vstore (base(A)-1, 1, vload (base(B)+1, 2))
IVDEP
A compiler directive by which the programmer asserts that a loop has no
"vector dependences", essentially telling the compiler that any unresolved
data dependence questions are to be assumed safe.
The specific semantics of
an IVDEP directive have become messier as compilers have gotten
smarter.
Kernel Scheduling
A combined instruction scheduling and vector register assignment technique that
implemented polycyclic scheduling on the Cray-2 and Cray-3,
providing good performance on long loops despite the absence of chaining.
Latency
Transit time through a functional unit, or response time on a load operation
from memory.
Leading Zero Count
An integer operation provided in scalar form on all Cray Research proprietary
processors and in vector form on all but the earliest Cray-1 systems.
Returns the number of contiguous zero bits in the most significant positions
of the 64-bit word. If the most significant bit
of the argument is set, the result is zero; if the argument is zero,
the result is 64.
The leading zero count function is useful when scanning through bit masks.
It can also be used to find the rightmost set bit in a mask efficiently.
We even use it for the logical NOT operator of C ("!").
Local Memory
A small (128KB) fast memory within each processor of a Cray-2 or Cray-3 system,
accessible only via special instructions executed by that processor
(hence the name).
Vectors could reference local memory only via unit stride loads and stores.
Local memory was useful for spilling vector registers in compiled code.
It was also used in hand-tuned libraries as a program-managed cache for codes
with temporal locality, effectively providing
a second memory port. Local memory could not be managed as a stack by
software due to restricted scalar addressing modes (no indirect+offset mode).
Generally frustrating to compiler writers; not carried forward into the Cray-4.
In general, a vector of single-bit values. Masks fit in one 64-bit scalar
register on most Cray machines, but need 2 registers on a C-90 or Triton,
where the maximum VL is 128.
Masks are generated as the results of integer comparisons between vectors and
zero, and also from IEEE floating-point comparisons on the most recent Tritons.
Masks are presently used as input to merge operations and to
the Triton's iota, compression, and expansion instructions.
In some proposed vector instruction set architectures, a vector mask would
control nearly every vector operation. Zero bits in the mask would cause
no-ops. Masked vector operation seems to provide a cleaner way to vectorize
IF statements than present techniques.
Merge, Vector
An operation between a vector, a mask, and either a scalar or another vector.
The elements of the vector result are selected from the operands by the mask
bits.
A way to vectorize IF statements containing only stores to
arrays.
The new values to be stored are merged under mask with the old array
elements, so that the elements corresponding to false IF
conditions are just loaded, merged, and stored back unchanged.
This technique requires some redundant computation and care must
be taken to avoid spurious interrupts.
Outer Loop Vectorization
Vectorizing a level of a loop nest other than the innermost.
Pack Loop
See Compression Loop.
Vectorizing a loop in the presence of a hard data dependence cycle,
leaving some operations in a scalar form. For example:
DO I=2,N
A(I) = A(I-1) + B(I)*C(I)*D(I)*E(I)
END DO
would be partially vectorized by the Cray-3 compiler by using vector
instructions for this expression:
B(I)*C(I)*D(I)*E(I)
and then using the elements of the result vector as operands to the scalar
operations of the inner part of the strip-mined loop nest.
Pipe
A collection of independent functional units and a slice of each vector
register. The C-90 and Triton are 2-pipe machines, meaning that each vector
register is interleaved across the two pipes (even elements on one,
odd elements on the other). Vector instructions are dispatched to all pipes
simultaneously, and, for most instructions, the pipes can execute independently.
4-pipe and 8-pipe machines have been proposed at Cray Research.
The NEC SX-4 is an 8-pipe machine. Performance studies show that the
payoff from multiple pipes drops off pretty quickly between 4 and 8 pipes.
Multiple pipes are generally invisible to the programmer, except for
the acceleration they provide to highly vectorized codes with longer
loop lengths.
Designing a functional unit so that a new operation can begin in each cycle,
even though any particular operation will require multiple cycles to complete.
If a unit has latency L,
a pipelined implementation will require L + N - 1 cycles for
N independent operations, as opposed to L * N
cycles for a non-pipelined implementation. Also called "segmentation"
of a functional unit (as in the CDC 7600).
Polycyclic Scheduling
In general, any form of instruction scheduling that causes multiple
iterations of a loop to be active in parallel. In scalar compilers,
polycyclic scheduling is often achieved by software pipelining or unrolling.
Vectorizing compilers for Cray machines with chaining need to use only
simple unrolling to get good performance from vector strip-mining loops,
but compilers for machines without chaining have required ambitious scheduling
techniques.
Population Count
A bitwise operation available in scalar form on all proprietary Cray
processors and in vector form on all but the earliest Cray-1 systems.
The population count of a 64-bit value is simply the number of bits set,
an integer in the range 0 to 64.
When vectorizing an IF statement via the compressed index method,
the population count of the mask becomes the vector length of the conditional
code.
The population count of the exclusive OR of two values is called the
"Hamming distance" and is useful in pattern matching. Population count can
also be used to implement an efficient "trailing zero count" function.
The rightmost bit of the population count is the (even) parity function.
Parity is so important to some customers that all Cray machines have also
provided a form of the population count function that returns only the
low-order bit.
Port
An abstraction of an independent path between a processor and the memory
system crossbar; like a "functional unit" for memory. The Cray-1 and Cray-2
processors had single memory ports that had to carry both loads and stores;
only one vector memory reference instruction could be active at a time.
The X-MP and its descendants are three-port machines in which two loads
and one store can all be simultaneously active.
The combination of multiple ports and multiple pipes provides the enormous
per-processor bandwidths of the C-90 and Triton machines.
Using the side-effects of a vector instruction to force some ordering of
operations without resorting to the relatively expensive CMR
instruction.
For example, to hold issue until a specific vector load has completed,
perform a dummy vector element transfer from the loaded vector into a
scratch scalar register.
PVP
A marketing term coined to distinguish Cray's symmetric parallel
systems from its then-new MPP machine. Stands for Parallel Vector Processor,
even though it pertains to a class of system architecture.
Recursion
A bug on the Cray-1 that turned out to be useful. As noted elsewhere,
the Cray-1 had a single element pointer per vector register.
If the same vector register was specified as
both an input and the output register of a computational instruction (say,
a floating-point addition), the elements of the result stream would be
used as the operand elements in all but the first few element positions!
Depending on the vector length and the latency of
the operation, a value might cycle through the unit seven or eight times.
This behavior turned out to be a good way of performing the final step of a
reduction loop, if you didn't mind having the correctness of
compiled code depend on a fixed latency assumption.
Reduction
A data dependence cycle in which a scalar variable is computed by a
commutative operation involving itself. Sum reductions, like:
X = 0.0
DO I=1,N
X = X + A(I)
END DO
are common and important. Reductions can be vectorized by doing a vector's
worth of partial sums in each strip, then adding up the vector's
elements after the loop is done.
This technique can change the results of reductions of floating-point addition
and multiplication, unfortunately, so Cray compilers have switches that
specifically disable vectorization of floating-point reductions.
Remainder
When strip-mining, usually by a maximum hardware vector length,
there's a short strip that has to be run before or after the full-length strips.
Also comes up in unrolling.
Restructuring
A generic term for optimizing transformations that alter the execution of
loop nests, including interchange, reversal, distribution, fusion,
unrolling, peeling, strip-mining, and vectorization itself.
Restructuring transformations must preserve the semantics of the program,
and use dependence analysis to discover the ordering relationships that
have to be observed.
Runtime Dependence
See symbolic analysis.
Safe VL
Using vector strips of possibly limited length to permit vectorization
in the face of a loop-carried data dependence. The loop:
DO I=11,N
A(I) = A(I-10) * X
END DO
can be safely vectorized so long as the strips are at most 10
elements long and proper bidirectional memory protection is in place between
the strips. In cases like:
DO I=J,N
A(I) = A(I+K) * X
END DO
the safe VL must be computed at runtime. When there's more than one
potential dependence in the loop, runtime code computes the safe VL as a
MIN function.
The CFT77
compiler, which had difficulty duplicating code internally, even used
the runtime safe VL computation technique to handle potential data
dependences like:
DO I=2,N
A(I,J) = A(I-1,K) * X
END DO
by computing a safe VL of 1 at runtime if J equals K,
otherwise using the maximum hardware vector length. Note that any loop
can be "vectorized" safely (with respect to data dependences) with a
strip length of one, though performance may suffer when compared to
scalar code.
Scan Loop
Like a reduction, except that the stream of intermediate values is significant.
DO I=1,N
X = X + A(I)
B(I) = X
END DO
Scans are generally not vectorizable, but
partial vectorization
often can come into play. The term "scan" is from Iverson's APL
language (as are also "iota", "compress", "expand", and "reduction",
at least).
Scatter
The analogue of a gather for indirect
vector store operations. Cray hardware guarantees that scatters with
duplicate index vector elements are well-behaved.
Search Loop
A case of an early exit loop in which the
value of the loop
control variable is used after the loop. Historically difficult for Cray
compilers to handle, surprisingly.
Strided Reference
A memory reference to multiple objects, addressed via a base address
and a sequence of integer multiples of a fixed stride value.
A strided vector load is conceptually equivalent
to a gather from the same base address with an index vector generated
from a compressed iota from a saturated mask.
Some folks use the word "strided" only if the stride is discontiguous.
Strip-mining
A loop restructuring transformation in which a loop becomes a two-level
nest whose inner loop is limited to the stride of the outer loop. Thus:
DO I=1,N
...
END DO
becomes:
DO I1=1,N,MAXSTRIP
I2 = MIN (N,I1+MAXSTRIP-1)
DO I=I1,I2
...
END DO
END DO
Vectorization for Cray hardware requires strip-mining because the
iteration space has to be partitioned into chunks that are no longer than the
length of a vector register.
After strip-mining, the operations of the inner loop are rewritten in
vector form and the inner loop itself, so recently created, is discarded.
Strip-mining has become useful on scalar machines with caches as part of a
loop restructuring transformation called "tiling", in which strip-mining
is coupled with loop interchange to optimize spatial and temporal locality.
A means of resolving potential data dependence questions that are
determined by the values of invariant scalars. For example,
DO I=1,N
A(I) = A(I+J) * X
END DO
can be vectorized safely if J is nonnegative, or run with shorter
strip lengths if J is negative. Some compilers can answer these
questions from earlier IF statement conditions,
such as:
IF (J .LT. 0) STOP
What cannot be deduced at compilation time can often be cheaply tested during
execution to select a version of code optimized under specific assumptions.
More generally, symbolic analysis is an approach to data dependence analysis
in which the compiler constructs a system of linear integer equations and
attempts to solve it without violating constraints.
If no solution is possible, no dependence exists. If a solution is possible,
the reduced system of equations can be translated into a runtime test.
The Cray-3's vectorizing compiler used symbolic analysis exclusively
for data dependence analysis after an efficient algorithm was discovered.
Tailgating
A capability of the Cray-3 and some very late model Cray-2 machines.
Tailgating allowed a vector instruction to begin writing the elements of a
vector register without waiting for a prior instruction that was reading the
vector to complete. This behavior is the opposite of chaining.
Tailgating effectively increases the number of vector registers,
since any vector register whose last use is in a given chime can be
immediately used for the result of a later instruction in the same chime.
This feature relieved register pressure, which was especially bad on the
Cray-2 and Cray-3 machines due to their need for ambitious polycyclic
scheduling for performance in the absence of chaining.
Triton
Internal code-name for the Cray T-90 product line. There are two kinds of
Triton processors. The first has a C-90 compatibility mode (which
unfortunately uses Triton memory ordering rules) and a Triton native mode
with Cray-1 floating-point arithmetic. There is also the "IEEE Triton"
processor that is incompatible with everything, since it uses IEEE 754
64-bit floating-point data formats. Unfortunately, though the IEEE Triton
required applications to be recompiled, the instruction set wasn't cleaned up;
it has the same small register sets of the Cray-1 architecture.
Unpack Loop
See Expansion Loop.
Unrolling
A loop transformation in which multiple loop iterations are executed together,
possibly with operations intermixed by scheduling.
Single-chime vector loops benefit from modest unrolling because register
conflicts can be avoided.
Unit Busy Time
The number of clock periods for which a vector instruction will reserve a
functional unit. On Cray machines, unit busy time is the vector length
divided by the number of pipes, plus a small overhead delay.
The part of the processor that computes a basic function.
Vector functional units are heavily pipelined. All Cray machines have had
separate floating-point addition and floating-point multiplication
functional units, with various other units dedicated to integer operations,
shifts, logicals, and so forth (often in combination).
A functional unit can be used at most once in a chime.
Update Loop
An important gather/scatter loop of the form
DO I=1,N
A(IX(I)) = A(IX(I)) + B(I)
END DO
The B(I) term can be any vectorizable expression.
The problem with vectorizing the
update loop stems from the possibility of having duplicate entries in the
IX(I) array.
Cray compilers vectorize the update loop via a patented technique in which a
scatter of an iota vector is used to check for duplicates,
which are fixed up in out-of-line scalar code that relies on the
commutativity of addition.
Vector Register
A register file with a many entries but only 1 read and 1 write port.
The elements are read in succession, and generally written that way
(exception: Cray-2 and Cray-3 memory loads).
Vectorization
The conversion of scalar loops into strip-mining loops with vector
instructions, in particular, and the necessary analysis and transformations
making the conversion legal, in general.
VFUNCTION
A feature of Cray compilers that indirectly allows the vectorization
of loops containing user function calls.
The VFUNCTION F directive informs the compiler that an
external function Fexists in three versions:
- F, which uses the usual argument passing conventions
- F%, which expects argument values in the scalar
registers, and
- %F%, which expects vectors of corresponding argument
values in the vector registers.
Further, VFUNCTION permits the compiler to assume that the
external function is free of side effects.
The last release of the CFT77 compiler can generate
multiple versions of
a function with the right naming and argument passing conventions
to permit the user to write VFUNCTIONs in Fortran.
VL
The name of the Vector Length control register.
The value of VL determines the length of vector operations.
The maximum value of VL is the same as the number of elements
in the vector registers.
The interpretation of values stored into the VL register has changed from
one generation to the next.
| Value Stored | Cray-1 | X-MP, Y-MP |
Cray-2 | Cray-3 | C-90, Triton |
| 0 | 64 * | 64 | 64 (0) | 0 | 128 |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 64 | 64 | 64 | 64 (0) | 64 | 64 |
| 65 | 1 | 1 | 1 | 64 | 65 |
| 128 | 64 | 64 | 64 (0) | 64 | 128 |
| 129 | 1 | 1 | 1 | 64 | 1 |
NOTES: The Cray-1 had no instruction for reading the VL register.
Although the Cray-2 did have a read VL instruction, it returned 0
when the effective VL value was 64!
The Cray-3 allowed VL to be set to zero (turning vector instructions
into no-ops) and interpreted values greater than 64 as 64.
The C-90 and Triton processors have vector registers with 128 elements.
The name of the Vector Mask control register.
VM receives the results of comparison instructions and is an operand
to merge instructions, as well as Triton's iota, compression,
and expansion instructions.
It is possible to transfer between VM and the general
purpose scalar registers.
VM has as many bits as vector registers have elements, since it is
conceptually a vector of 1-bit values.
The most significant bit of VM (as interpreted as an integer)
corresponds to element 0 of the vectors.
On the C-90 and Triton, VM occupies two 64-bit words.