Using clusters for anything other than running lots of small jobs on the clusters' individual nodes - this type of activity is nowadays called ``capacity computing'' - is highly non-trivial and calls for considerable computing skills. This non-trivial activity is often referred to as ``supercomputing'', but it differs markedly from supercomputing sensu stricte and is more broadly applicable too. Whereas real supercomputing is concerned with number crunching first and foremost and is best done on highly specialized machines that are not based on common business CPUs, cluster ``supercomputing'' is often focused on manipulating very large amounts of data as in data mining, for example, or in managing and manipulating very large multi-tera-byte data bases.
A more general term, which is perhaps more appropriate, is ``High Performance Computing'' or HPC for short. HPC is not ``super''. It is a notch below, but it is broader, both in terms of applicability and availability. Whereas supercomputers are rare, expensive (though not expensive in terms of $/GFLOPS - this is the main point about these highly specialized machines) and available to a handful of institutions only, HPC systems such as clusters of various sizes, including also clusters of SMPs, are very common. You will find them in laboratories, corporate computing rooms, and at the Internet providers' shops, sometimes even in the hobbyists' attics. Some vendors, e.g., IBM and Gateway, offer access to their own clusters to customers who need to use such facilities occasionally, but don't want the bother of having to buy, configure and maintain them in their own machine rooms.
So, who is this course for? This course is for people who need to learn how to use clusters to manipulate very large data sets in the High Performance Computing style, not ``chunk by chunk'' but the whole lot, all, at once. I have emphasized the phrase ``need to''. This is important, because such HPC data manipulation is not easy. There is a lot of difficult stuff you will have to learn. High Performance Computing on clusters is hard, tedious and wasteful of researchers' time. If you can do your job in other ways, without having to use HPC techniques, do so. Resort to ``capacity computing'' if you can, if your particular problem type, data structures and data processing methodologies let you do this.
But if you're interested in parallel data bases, data mining, or any other activity that is concerned with analyzing, processing and modifying a single gigantic data set, a data set that does not fit in the memory and on the disk of a single computer, then this is the course for you.
Now, if you are interested primarily in the ``capacity computing'', I can hear you ask the question ``Where am I going to learn how to do this?''. The point is that if you have the required level of skills for this course (see section 1.2 that talks about it) you already know how to do this. All you need here is to know how to edit, compile, link and run a UNIX program, and how to write simple UNIX shell scripts. You can then write a trivial shell script that distributes your program with arbitrary initial data to an arbitrary number of cluster nodes, or you can use the AVIDD batch queue system, PBS, to do the same in a way that is more considerate of other users (and less vulnerable to skulker). If you would like to learn how to do the latter, you can attend the course up to chapter 4 and then simply drop out.