Starting a run

From OpenGGCM
Jump to: navigation, search

Contents

Configure the user environment

On Zaphod, and many other High Performance Clustered computers, a system for switching between different versions of compilers and libraries exists, it is called 'modules'. Here is an example of how to find what compilers are possible on Zaphod:

user@h101:~> module available

------------------------- /usr/local/Modules3.1.6/modulefiles --------------------------

Chombo-2.0/mpich-1.2.7_gcc-3.3.3_p4     null                                                 
CommandTailContainer/gcc-3.3.3          openmpi/1.2.5/gcc-4.1.1                              
cmake-2.6.0                             openmpi/1.2.5/gcc-4.3.2                              
dot                                     openmpi/1.2.8/gcc-4.3.2                              
fftw/2.1.5/gcc-4.1.1                    petsc/2.3.3-p15                                      
fftw/2.1.5/gcc-4.3.2                    pgi/5.2                                              
gcc/4.1.1                               pgi/7.0                                              
gcc/4.3.2                               pgi/7.0-2
gdbClusterUtils/mpich_1.2.7-gcc_3.3.3   pgi/7.1
gm/2.0.22                               pgi/7.1-1
gsl/1.9.0                               pgi/7.1-5
hdf5/1.6.3/gcc-3.3.3/mpich-1.2.7/p4     pgi/7.2-2
hdf5/1.6.3/gcc-3.3.3/mpich-1.2.7/serial pgi32/5.2
hdf5/1.6.3/gcc-4.1.1/gm                 pgi32/7.0
hdf5/1.6.3/gcc-4.1.1/p4                 pgi32/7.0-2
hdf5/1.6.3/gcc-4.1.1/serial             pgi32/7.1
hdf5/1.8.1                              pgi32/7.1-1
module-info                             pgi32/7.1-5
modules                                 pgi32/7.2-2
mpich/gm/1.2.6..14a(default)            pgi64/5.2
mpich/p4/1.2.6                          pgi64/7.0
mpich/p4/1.2.7(default)                 pgi64/7.0-2
mpich/p4/1.2.7-gcc-3.3.3                pgi64/7.1
mpich/p4/1.2.7-gcc3.3.3                 pgi64/7.1-1
mpich/p4/1.2.7-pgi7.0                   pgi64/7.1-5
mpich-gcc/gm/1.2.6..14a-gcc             pgi64/7.2-2
mpich-gcc4/gm/1.2.7..15-gcc4            pgi_32/5.2
mpich-gcc4/p4/1.2.7-gcc4                phdf5/1.8.1
mpich-pgi7/gm/1.2.7..15-pgi7            python/2.5.2
mpich-pgi7/gm/1.2.7..15-pgi7-medium     swig/1.3.38
mpich-pgi7/p4/1.2.7-pgi7                use.own

The following dialog shows loading a compiler and mpich library, verifying that it was loaded, purging the selection, and verifying that the purge was successful.

user@h101:~> module load pgi64/7.2-2 mpich-pgi7/gm/1.2.7..15-pgi7
user@h101:~> module list
Currently Loaded Modulefiles:
  1) pgi64/7.2-2                    2) mpich-pgi7/gm/1.2.7..15-pgi7
user@h101:~> module purge
user@h101:~> module list
No Modulefiles Currently Loaded.

Create a RUN directory

Runs are identified by the RUN environment variable, which is set during the build process. The user is required to create a RUN directory and populate it with a "runme" file. For example, if RUN=ggcmrun0001:

$ mkdir $HOME/ggcmrun0001
$ cd $HOME/ggcmrun0001
$ cp /directory/where/runme/exists/runme .

One often also includes the following "clean" script, which is useful if one needs to start the build process over from scratch. We usually create the file as $HOME/bin/clean since most users add $HOME/bin to their PATH.

#!/bin/sh
HERE=`pwd`
RUN=`basename ${HERE}`
/bin/rm -r -f BASETIME DIPOLTIME MONITOR PDATA SWMONITOR \
    *~ core geo.h* grid* in.* map.* maxmaps* script.* \
    *.grid *.grid2 *.smf *.ps run *.o a.out *.exe *.f tmp* \
    lall* postsc* grid_include.* in.* gridinfo.* map.dat.* \
    minvar run.tar run.tar.gz *.tar throwerror \
    *.sh *.tar.gz ${RUN}*
ls -l
exit 0

Final steps.

Finally, the user should create a $HOME/ggcm_data directory, which is where data files will be written to. It's best to use one of the Zaphod storage nodes (s101-s106) for this purpose. For example, on the head node h101, storage node s105 is NFS mounted as /mnt/data07.

$ cd $HOME
$ ln -s /mnt/data07 ggcm_data

Execute the runme script

Theoretically, the user should now be ready to build the executable on zaphod:

$ cd $HOME/$RUN
$ ./runme

After a minute or so, a new $HOME/ggcm_data/$RUN directory is created. This directory contains some input files as well as the executable. For example:

$ ls $HOME/ggcm_data/$RUN
clock45_0006.exe clock45_0006.smf script.runtimegraphics
clock45_0006.f in.clock45_0006

Submit a job to the Zaphod batch system

Currently, Zaphod makes use of the TORQUE batch system, which is essentially just a new version of the unsupported OpenPBS batch system (see ZaphodAdmin:SysAdmin:Batch system for details). You should therefore create a batch script in $HOME/ggcm_data/$RUN to execute $RUN.exe (in this example, clock45_0006.exe). For various reasons, it's best if you use mpiexec rather than mpirun (see the Zaphod:Run an MPI job page for general instructions).

Here is an example batch script /home/jdorelli/ggcm_data/test0001/test0001.batch:

#!/bin/csh

#PBS -l nodes=2:ppn=2
#PBS -l walltime=00:10:00
#PBS -q debug

cd $HOME/ggcm_data/test0001
/usr/local/bin/mpiexec ./test0001.exe </dev/null >& test0001.log

Alternatively, one can use the bash shell.

#/bin/bash

if [ $# -eq 0 ]; then              
   echo "Usage: $0 RUNNAME"        
   echo "Missing the run name!"    
   exit 1                           
else                                
   export RUN=$1                    
fi                                  

export FILESYSTEM=/mnt/data0N/YOU       <--- change this       
export TOP=$FILESYSTEM/$RUN     
# Calculate how many nodes you need:           
# (6*4*2 + 3)/2 = 26                     
export NODES=26

export FILE_BATCH=$TOP/$RUN.batch
echo "FILE_BATCH=$FILE_BATCH"
/bin/rm -f $FILE_BATCH

export PBS_O_WORKDIR=${TOP}

cat > $FILE_BATCH <<END_BATCH
#!/bin/bash
#PBS -q long
#PBS -l nodes=${NODES}:ppn=2:myri
date
cd $PBS_O_WORKDIR
/usr/local/bin/mpiexec ${TOP}/${RUN}.exe >& ${TOP}/${RUN}.log
date
exit 0
END_BATCH

chmod 755 $FILE_BATCH
cat       $FILE_BATCH
/bin/ls -l $TOP
qsub      $FILE_BATCH

One submits the job as follows:

$ qsub test0001.batch

The code output is redirected to test0001.log

Checking on your job

There are some nice commands to tell you about the status of your batch jobs.

you@h101:~> qstat -a

h101.cl.unh.edu:
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
24137.h101.cl.unh.ed YOU      long     intque.bat    --     22  --    --  999:0 Q   --
24193.h101.cl.unh.ed YOU      long     job_23Mar2  12372     5  --    --  168:0 R 05:27
24241.h101.cl.unh.ed YOU      long     job_23Mar2    --      5  --    --  168:0 Q   --

you@h101:~> showstate
cluster state summary for Tue Feb  7 13:47:20


    JobName            S      User    Group Procs   Remaining            StartTime
    ------------------ - --------- -------- ----- -----------  -------------------

(A) 8141               R   nbessho  nbessho    16    19:34:04  Tue Feb  7 09:21:24
(B) 8142               R   nbessho  nbessho    16    19:34:04  Tue Feb  7 09:21:24
(C) 8147               R   dpontin    users    32  2:01:41:15  Tue Feb  7 13:28:35
(D) 8099               R      pzhu     pzhu    64  2:21:56:07  Fri Feb  3 11:43:27
(E) 8110               R       kai    users    16  3:02:58:49  Fri Feb  3 16:46:09
(F) 8130               R       djl      djl    34  5:00:29:52  Mon Feb  6 14:17:12
(G) 8119               R   wenhuil  wenhuil    66  5:11:45:49  Mon Feb  6 01:33:09

usage summary:  7 active jobs  122 active nodes

               [0][0][0][0][0][0][0][0][0][1][1][1][1][1][1][1][1][1][1][2]
               [1][2][3][4][5][6][7][8][9][0][1][2][3][4][5][6][7][8][9][0]

Frame      01: [ ][!][ ][!][ ][ ]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Frame      10: [ ][ ][ ][ ][ ][C][C][C][C][C][D][D][D][D][D][D][D][D][!][D]
Frame      11: [D][D][D][D][D][!][D][#][D][#][D][D][D][D][D][D][D][D][C][C]
Frame      12: [C][C][C][C][!][C][C][C][C][C][B][B][B][B][B][B][D][D][B][B]
Frame      13: [G][G][G][G][G][G][!][!][G][G][G][D][D][D][D][D][D][!][G][G]
Frame      14: [G][G][G][G][G][G][G][G][G][G][G][G][G][G][G][G][G][G][G][G]
Frame      15: [G][G][A][A][A][!][A][A][A][A][A][E][E][E][!][E][E][E][!][E]
Frame      16: [!][E]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Frame      17: [ ][ ][ ][ ][ ][ ][ ][ ][#][ ]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Frame      18: [ ][ ][ ][ ][ ][ ][#][ ][ ][F]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Frame      19: [F][#][F][#][F][F][F][F][F][F]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Frame      20: [F][F][F][F][F][F][F][F]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Key:  [?]:Unknown [*]:Down w/Job [#]:Down [ ]:Idle [@] Busy w/No Job [!] Drained

node m128 is down
node m130 is down
node n109 is down
node n117 is down
node n122 is down
node n124 is down

The subversion repository on Artemis has a script that may be useful to you when working on a project. The idea is that a project will have multiple runs associated with it. So the cycle script will allow you to dryrun the compile and submit process. Experienced users who are modifying the OpenGGCM source will find the dryrun feature useful.

svn list svn+ssh://YourUserName@artemis.sr.unh.edu/usr/svn/cluster_scripts 

Grab the one called cycle.sh

svn co svn+ssh://YourUserName@artemis.sr.unh.edu/usr/svn/cluster_scripts/cycle.sh

The command

sh cycle.sh -h

will execute the script with a help message.

more to come etc.