Group 4 -- Address units
The machine itself is divided into six major subsystems
- Instruction component
- Address component
- Scalar component
- Vector component
- I/O component
- Instruction Component
Cray 1 instructions are 32 or 16 bits, so from 2 to 4 instructions can be packed into a word. Instructions are thus addressed on 16-bit boundaries while data is addressed on 64-bit boundaries.
The instruction unit has four 16-word instruction buffers, three instruction registers, and one instruction counter. Each 16-bit field in a word is called an instruction parcel.
The three instruction registers are
- Next Instruction Parcel -- holds first parcel of the next instruction, prefetched from buffer
- Current Instruction Parcel -- holds the high-order portion of the instruction to be issued
- Lower Instruction Parcel -- holds low-order portion of instruction to be issued
For a 32-bit instruction, the low-order portion is fetched to the NIP and then moved to the LIP. There is no mechanism for discarding instructions in the pipe -- once in the CIP/LIP, they will be issued. At most they will be delayed for some time.
The instruction buffers are tied to the memory via the 16-way interleaving, so it is possible to fill a buffer in 4 clock cycles (recall that the clock is 12.5 ns and memory is 50 ns). Buffers are filled on a demand basis in a round-robin pattern. They thus act as an instruction cache of 256 instructions, organized into four lines of 64 instructions. Each buffer has its own address comparator, so we would call this a fully associative cache (easy to implement when there are only 4 lines). The buffers cannot be written to -- a write bypasses the instruction cache and only goes to main memory.
Scalar instruction issue requires that all of the instruction's required resources be free -- otherwise the instruction waits. Vector instruction issue in the Cray involves reserving functional units, including memory, operand registers and result registers, and then releasing an instruction once all of its resources are available. In addition, some data paths are shared between the vector and scalar components, and these must be available.
The control unit is able to detect when a result register for one vector operation is an operand for another vector operation and, if the two vector instructions do not conflict in any other resource requirements, it sets up a vector chaining operation between the two instructions.
There are 8 24-bit address registers, 64 24-bit spill registers, an adder, and a multiplier in this component. Its purpose is to perform index arithmetic and send the results to the scalar and vector components so that they can fetch the appropriate operands.
Arithmetic is performed on the address registers directly. The spill registers are used to hold address values that do not fit into the address registers. A set of 8 addresses can be transferred between the address registers and their spill registers in a single cycle. Thus, they bear a certain similarity to the register windows of the SPARC (or vice versa). The spill registers can be thought of as an explicitly managed data cache with 8 lines. Their value is that they reduce the traffic to main memory, freeing that resource for vector operations.
Similar to the address component, the scalar component has 8 64-bit registers and 64 64-bit spill registers. It has sole access to four functional units: Integer Add, Logical, Shift, and Population Count. The Scalar Component also has access to three functional units that are shared with the Vector Component: Floating Add, Multiply, and Reciprocal Approximation.
Because the scalar component has its own integer units, it can always execute integer operations in parallel with a vector operation. However, for floating point, the vector unit takes priority.
The are 8 64-word vector registers in the vector component. It takes four memory loads to fill a vector register. Normally, this would require 16 instruction cycles. However, careful pipelining in the memory unit reduces the time to just 11 cycles.
A vector mask register contains a bit-map of the elements in a register operand that will participate in an instruction. A vector length register determines whether fewer than 64 operands are contained in a set of vector operands. Manipulating these values is the primary reason for the population and leading zeros counter.
Vector loads and stores specify the first location, the length, and the stride.
The I/O component has 24 programmable I/O channel units. I/O has the lowest priority for memory access.
- Extended the Cray-1 architecture to 4-way multiprocessing.
- Cycle reduced to 8.5 ns (117 MHz)
- Increased instruction buffers to 32 words
- Added a multiport memory system.
- Redesigned the vector unit to support arbitrary chaining.
- Added Gather/Scatter to support sparse arrays.
- Increased memory to 16 M words, 32-way interleave
- Provides a set of shared registers to support fine-grained (loop-level) multiprocessing. There are N+1 sets of these registers for an N-processor system. They include eight address registers, 8 scalar registers, and 32 binary semaphores.
- The I/O system was improved and a solid state disk cache was added.
- Extends the X-MP architecture to 8 processors.
- Cycle reduced to 6 ns (166 MHz)
- Extends memory to 128 M words
- One foreground and four background processors.
- 4.1 ns cycle (244 MHz)
- Up to 256 M words of memory
- 64 or 128 way interleave depending on configuration
- Eliminates the spill registers in favor of a 16K word cache
- Cache feeds all three computational components with 4-cycle access time
- Has 8 16-word instruction buffers
- Foreground processor controls the I/O subsystem, which has up to 4 high speed ] communication channels (4 Gb/s).
Practical Considerations in Supercomputer Design
To achieve such high speeds, high-power (i.e. hot) drivers are employed, signals are detected with specialized analog circuits, conductors are all shielded and precisely tuned in both impedance and length, and data is encoded with error-correcting so that losses can be recovered.
In addition, the circuits are usually designed to operate in balanced mode so that there is no change in power drawn as drivers switch. As one driver switches from low to high, another switches from high to low, so that the power supply sees a DC load and there is no coupling of switching noise back into the logic via the power supply. In addition, using balanced signal lines can increase the signal to noise ratio by 6dB, although these are not often used. In a design such as the Cray-1, roughly 40% of the transistors supposedly do nothing but balance the power loading.
Even so, these machines dissipate large amounts of heat. The IBM 3090 uses special thermal conduction modules in which a multichip substrate is mounted in a carrier with built-in plumbing for a chilled water jacket. CDC used a similar system in its designs, and on one instance a maintenance crew pumped live steam through the building air conditioning system, which crossed over to the processor, with predictable results. This raises the issue that these machines usually need thermal shut-down systems, and possibly even fire suppression gear.
The Cray-1 series uses piped freon, and each board has a copper sheet to conduct heat to the edges of the cage, where freon lines draw it away. The first Cray-1 was in fact delayed six months due to problems in the cooling system: lubricant that is normally mixed with the freon to keep the compressor running would leak through the seals as a mist and eventually coat the boards with oil until they shorted out.
The Cray-2 is unique in that it uses a liquid bath to cool the processor boards. A special nonconductive liquid (flourinert) is pumped through the system and the chips are immersed in this.
Special fountains aerate the liquid, and reservoirs are provided for storing the liquid when it is pumped out for service. This is somewhat remeniscent of the oil cooling bath that was sometimes used in magnetic core memory units.
As a final note, Lawrence Livermore National Labs has announced that it will henceforth buy no more vector supercomputers. The handwriting is clearly on the wall for this breed of system, and all of the major manufacturers are moving, finally, to parallel processing.
|5 star5 star (0%)||0%|
|4 star4 star (0%)||0%|
|3 star3 star (0%)||0%|
|2 star2 star (0%)||0%|
|1 star1 star (0%)||0%|