In this section we are going to do what we have done in section 4.3.4, but we will use PBS facility for defining job dependencies instead. We will have four scripts as before, but the scripts will be simpler and they will not submit other scripts. Instead we are going to tell PBS how our jobs depend on other jobs, so that PBS will wait for the first job to finish before it will release the second job. Then PBS will wait for the second job to finish, before the third job gets released, and so on. All jobs will be submitted at the same time from a single shell script.
The four jobs,
fourth_1.sh look the same as
the jobs in section 4.3.4,
second.sh, verb|third.sh| and
with the exception that the job submission lines were commented
out. The real trickery is in the shell script that does
the submissions. Here is the script:
[gustav@bh1 PBS]$ cat submit_1 #!/bin/bash FIRST=`qsub first_1.sh` echo $FIRST SECOND=`qsub -W depend=afterok:$FIRST second_1.sh` echo $SECOND THIRD=`qsub -W depend=afterok:$SECOND third_1.sh` echo $THIRD FOURTH=`qsub -W depend=afterok:$THIRD fourth_1.sh` echo $FOURTH exit 0 [gustav@bh1 PBS]$Command
qsubreturns the job ID and this is normally printed on standard output. Here we capture the output of
FOURTH. The second job is submitted with option
-W depend=afterok:$FIRSTThis means that the job itself is going to be put on hold until the first job has completed with no errors. Only then the second job is going to be released. The third and fourth jobs are treated similarly.
Let us run the script and see what happens:
[gustav@bh1 PBS]$ ./submit_1 13876.bh1.avidd.iu.edu 13877.bh1.avidd.iu.edu 13878.bh1.avidd.iu.edu 13879.bh1.avidd.iu.edu [gustav@bh1 PBS]$ qstat | grep gustav 13876.bh1 first gustav 0 Q bg 13877.bh1 second gustav 0 H bg 13878.bh1 third gustav 0 H bg 13879.bh1 fourth gustav 0 H bg [gustav@bh1 PBS]$ qstat -f 13878.bh1 Job Id: 13878.bh1.avidd.iu.edu Job_Name = third Job_Owner = firstname.lastname@example.org job_state = H queue = bg server = bh1.avidd.iu.edu Checkpoint = u ctime = Sat Sep 13 14:33:26 2003 depend = afterok:email@example.com, beforeok:firstname.lastname@example.org Error_Path = bh1.avidd.iu.edu:/N/B/gustav/PBS/third_err Hold_Types = s Join_Path = oe Keep_Files = n Mail_Points = a mtime = Sat Sep 13 14:33:26 2003 Output_Path = bh1.avidd.iu.edu:/N/B/gustav/PBS/third_out Priority = 0 qtime = Sat Sep 13 14:33:26 2003 Rerunable = True Resource_List.ncpus = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 00:30:00 Shell_Path_List = /bin/bash Variable_List = PBS_O_HOME=/N/B/gustav,PBS_O_LOGNAME=gustav, PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b in:/usr/local/gm/bin:/usr/lpp/mmfs/bin:/opt/intel/compiler70/ia32/bin:/ usr/local/maui/bin:/usr/pbs/bin:/usr/pbs/sbin:/opt/pgi/linux86/bin:/N/h pc/totalview/bin:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/i686/bin:/opt/x cat/i686/sbin:/N/B/gustav/bin,PBS_O_MAIL=/var/spool/mail/gustav, PBS_O_SHELL=/bin/bash,PBS_O_HOST=bh1.avidd.iu.edu, PBS_O_WORKDIR=/N/B/gustav/PBS,PBS_O_QUEUE=bg [gustav@bh1 PBS]$We have generated four jobs, which were all submitted at roughly the same time. But only the first job is queued, whereas the remaining three jobs are on hold. Requesting the full listing of the third job with
qstat -fshows the dependency:
depend = afterok:email@example.com, beforeok:firstname.lastname@example.orgThe job can be started only after 13877 has completed without errors. Observe that PBS has recognized another dependency, which I have not specified explicitly. Namely that after this job, 13878, has completed without errors, then job 13879 should be started, i.e., that there is another job that depends on this one.
The dependency is specified by using the
-W option to
qsub. The option is generally used for additional
attributes, of which dependency is one. The word
depend that flags this attribute must be followed by
a list of jobs on which the submitted job depends
qualified with types of dependencies, e.g.,
-W depend=afterok:13876.bh1.avidd.iu.edu:13877.bh1.avidd.iu.eduHere we state that the job can be released from hold only after two preceding jobs,
13877.bh1.avidd.iu.edu, have completed their run without errors.
The jobs get released one after another. This can be seen by running
qstat every now and then:
[gustav@bh1 PBS]$ qstat | grep gustav 13878.bh1 third gustav 00:00:05 R bg 13879.bh1 fourth gustav 0 H bg [gustav@bh1 PBS]$
Eventually everything completes and we are left with four logs in the PBS directory:
[gustav@bh1 PBS]$ cat *_out /N/gpfs/gustav prepared and cleaned. Directory /N/gpfs/gustav cleaned. writing on test writing 1000 blocks of 1048576 random integers real 2m42.813s user 0m40.040s sys 0m14.760s -rw-r--r-- 1 gustav ucs 4194304000 Sep 13 14:44 test File /N/gpfs/gustav/test generated. reading test reading in chunks of size 16777216 bytes allocated 16777216 bytes to junk read 4194304000 bytes real 2m51.521s user 0m0.000s sys 0m12.600s File /N/gpfs/gustav/test processed. [gustav@bh1 PBS]$Observe that the IO on writing is 32 MB/s and only 23 MB/s on reading. This illustrates yet again how much IO can vary depending on the system load and configuration.