AVIDD est omnis divisus in partes quartes quarum una in Bloomington, alia in Indianapolis, tertia in Gary locatur. Pars quarta quoque in Indianapolis locatur, sed restricta est.
This is how Julius Caesar might have described the AVIDD cluster before vanquishing it, burning it to cinder and enslaving its administrators: it is made of four parts, of which the first resides in Bloomington, another in Indianapolis and the third one in Gary. The fourth part lives in Indianapolis too, but is accessible to IU Computer Science researchers only.
The IUB component of the cluster has 97 computational nodes, 4 file serving nodes (too few for a cluster of this size and, especially, for a cluster dedicated to IO, and 3 head nodes. The IUPUI component is similarly configured, i.e., it has 97 computational nodes, 4 file serving nodes and 3 head nodes.
The IUN component has only 8 nodes. It is supposed to be used for teaching, e.g., for a course like I590, but I cannot reach it from Bloomington (it may be fire-walled) and so I didn't have a chance to look at it yet.
All AVIDD nodes we are going to work with are 2-way IA32 (Pentium 4) SMPs and run Linux 2.4. So, in effect, we have 194 CPUs dedicated to computations on each of the two major IA32 components and you can submit jobs that span both components, utilizing all 388 CPUs. These CPUs run at 2.4 GHz, and their floating-point unit registers are 128-bit long. They support streaming SIMD (Single Instruction Multiple Data) instructions. This means that you can perform, in theory, up to 931.2 quadruple precision floating point operations per second on the whole 2-component cluster, or double that in double precision, if you pack your instructions and data movement within the floating point processor very cleverly.
This performance will collapse dramatically if you have to feed your data into the floating-point unit from the memory of the computer, because, first, the memory bus is only 32-bits wide, and second, it's going to take great many cycles to move a 32-bit wide word from memory to the CPU. Nevertheless, our very clever computer scientists managed to wrench more than a TFLOPS performance from the combined 4-component cluster on a LINPACK benchmark. On a normal off-the-shelf parallel application, running on the whole cluster, you should expect about 5% of this, i.e., about 50 GFLOPS. With a lot of optimization you may be able to get it up to about 80 GFLOPS. The trick is to load as much data as possible into these 128-bit long registers of the floating point unit, and then keep the data there as you compute on it, while moving data in the background between memory and higher level caches.
We will not go into any of this, because it is a complete waste of time
unless you are a well paid professional programmer optimizing a commercial application for a commercial customer.
Always read the small print at the bottom of the page!
This is where you may find some really interesting stuff, like the fact that I am an alien in a human shape, with a body made of organic stuff that is quite worn out by now so that it is discolored, wrinkled and falls off in places. We arrived here in the Solar System shortly after what you call World War II and have been here ever since. My real body looks a little like a large octopus and was engineered to be resistant to vacuum, high radiation doses, temperature extremes and weightlessness, and right at this moment it is floating in front of a servo-robot manipulator in a spaceship hidden at the Lagrange point behind your moon. The manipulator is tele-linked to the body you see in front of you in the class. There is a slight delay between my instructions and the body's responses that may seem to you like your professor has slow reactions and is absent minded. Since this is typical of your faculty in general, we never had to do much about it.