We won't have enough time left in this semester to cover this very important and interesting topic in depth. An introductory lecture, for which the notes are provided in this chapter, is going to be given on the 18th of November, this on account of the AVIDD cluster possibly not being available because of the SC2003 conference.
A database is a suite of software utilities for maintenance and operations on tables of data. If the data is stored as a table, a great deal of optimizations for data searches, updates, insertions, deletions, and other operations, including storing the data itself, can be implemented. A single data base will normally contain numerous tables, but a single table cannot usually belong to more than one data base.
Data bases can store data on disk drives formatted specially for their use - this is normally the most efficient way and this is how corporate computer centers usually go about it - or on the native file system. The latter is almost certainly going to be the case on a research system such as AVIDD, where administrators would loath to sacrifice general purpose disk space and have to format specialized partitions for the data base.
A data base is a typical example of a client-server system. Every data base has a data base server that is always up, waiting for connections, and that takes care of the data base itself. It is quite OK to think of the server and the data it guards, as the data base. In order to access the data, or even in order to insert the data into the data base, you have to use a data base client. The client normally runs on a machine other than the machine on which the server runs, although a data base administrator may use a client running on the same machine in order to manage the data base.
Data bases are the darlings of corporate computing. You'll find them in just about every company beginning with a small real estate agent operation, where a single data base would probably run under Windows on a single PC server, all the way up to major corporations that often maintain hundreds, even thousands, of data bases running on mainframes, parallel computer systems such as IBM SP, and large SMPs like HP Superdome or Sun Starfire.
On the other hand, data bases have not made great strides in scientific computing, because of their cost, fuss associated with setting them up and with maintaining them, inflexibility - once you have defined a table, you are stuck with it, and, besides, not every type of data fits into a table, and because they are client-server systems. So, almost exactly what endears data bases to businesses has been a huge turn off for scientists and hackers.
But this is changing.
For many years there were no freeware data bases of quality. Once upon a time there was a data base called Ingress, which was distributed for almost free with BSD UNIX - ``almost'', because you had to pay some token money for BSD UNIX (about $500 per campus). But this was a long time ago (about 20 years or so - some of you may not have even been born yet). Ingress programmers went commercial later and even though the earlier version of Ingress remained free, it was very antiquated and unsupported.
There was a good reason for this state of affairs of course. The reason was that writing data bases was such a lucrative, but also difficult, undertaking that no programmer would miss the opportunity to make millions of dollars in the process.
But today we have some freeware data bases, and at least one of them, that I know of, is very good indeed. The data base is called PostgreSQL and it is developed and maintained by an international community of dedicated (usually academic) programmers, much like Linux. PostgreSQL is distributed with Linux and with Cygwin. Even though it is not installed on the AVIDD cluster at the time of this writing, you can always request it and the AVIDD administrators should have no problems adding it to the pool of available software on all AVIDD nodes - although most likely you will just need PostgreSQL clients and client libraries on the computational nodes and the data base itself may run either on a selected AVIDD node or even outside of AVIDD. PostgreSQL has its WWW page at http://www.postgresql.org/.
This free availability and ease of use of PostgreSQL made many scientists change their mind about data bases. ``Well, if it's free and I can have it on my PC or on my Linux cluster, then I may just as well have a look at it and think what I could do with it.''
The other reason is that the two great data bases, namely, Oracle and DB2 are free to academics, at least in this country, too. For example, if you enroll in the DB2 Scholars' Program with IBM, you'll get the whole DB2 shipped to you in a box and you don't have to pay a cent for it. There is a similar Oracle program too. Both DB2 and Oracle are insanely great products and you should not believe people who will tell you that you should stay away from one or the other. But both can be pretty pricy too, the moment you leave academia. Microsoft's SQL-Server is not free to academic institutions though (at least at the time I'm writing these words and as far as I know) and it is not quite in the same league as DB2 and Oracle. The latter two run on everything: PCs, Windows, Linux, Suns, HP systems, IBM systems, you name it. But Microsoft's SQL-Server runs under Windows only (although you can connect to it from non-Windows clients) and it had quite a number of very serious security problems recently.
Next, it ought to be said too that science has evolved recently in the direction where problems acquire great complexity and structural richness. Take, for example, epidemiology - a typical area of medical science where data bases are quite essential. In fact anything to do with medicine involves data bases, because you need to keep patients' records somewhere and you need to carry out searches on the records and other queries against such an information system. The Regenstrief Institute in Indianapolis has built and maintains the largest medical information system in the world.
Another example of data bases being used in science are data bases at SLAC, which are used to store data pertaining to high energy physics events observed in SLAC experiments. Every event is described by the means of a table row. The data bases they have are amongst the largest in the world, because even a very short experiment can easily generate trillions of events worth keeping in the data base. The manipulation of and searches on such gigantic data bases, which can very quickly grow to tens, even hundreds of TBs, is an extremely difficult undertaking. Even the largest business data bases seldom grow above some hundreds of GBs. The reason for the latter is that business data normally has to be typed in by human operators, whereas data flows into the SLAC data bases directly from measuring instruments.
Library science is an ``in-between'' category. Normally it should be more like business, because library catalogues, which are data bases too, have to be constructed by human operators. But more recently methods have been developed to extract abstracts, keywords and other cataloguing information from electronic texts automatically. This makes the task of feeding bibliographic data into the catalogues a little more like the SLAC systems. On the other hand, the texts themselves, still have to be written by humans, which limits the size of the task dramatically, because humans are slow typists. If you took the whole National Library collection and made an electronic copy of every book they have there (but not in terms of images, rather in terms of converting every letter to the corresponding ASCI character) - you would end up with at most a few TBs. On the other hand, the moment you add images, sounds and videos to the collection, the amount of data skyrockets very quickly. But cataloguing images and videos is a tricky business, because it's very difficult to tell a computer ``what to look for''.
Between physics and library science just about every other research discipline can benefit from the use of data bases. In most cases the data bases are probably going to be rather small - they should easily fit onto a single PC. Just today I have seen a Dell advertisement in the Sunday newspaper about a small deskside PC with a 140GB drive, 512MB memory and a very fast 2.3GHz CPU for some $700 or so. A system like this should be more than sufficient for 99% of data bases that small research groups would like to concoct - especially if the data is to be entered manually or semi-automatically.
A very typical personal
research data base is a list of citations, which
can be accessed from your TEX,
document. Such data bases are constructed laboriously over years
of the researcher's life, with much pain, devotion, and lots
of typos. They're treasured like the Holy Grail, but they are trivial by data base standards. They seldom contain
more than a few thousand citations, which translates into a
mere 5MB or so.
Consequently, when you think about using a data base in your research, the first question you should ask yourself is how large is this data base going to be. You'll be surprised how difficult it will be to make it truly large. But if you arrive at some moderately impressive numbers here, say, tens of GBs, or some quite impressive numbers like hundreds of GBs then you should definitely consider using parallel DB2 on the AVIDD cluster - especially if you expect to run a lot of searches and data mining procedures against it.
In this chapter I'm going to show you first how you can set up and operate on a PostgreSQL data base under Cygwin - and the same should apply to the AVIDD cluster, because it is the same data base.
Then I am going to discuss, rather briefly, what you are going to get if you switch to parallel DB2.