Orca Tips: Difference between revisions

From Fluids Wiki
Jump to navigation Jump to search
(Added a page to include Sharcnet tips, tricks, and a guide for courteous usage.)
 
m (Add a comment)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''NOTE:''' Orca was changed to use the same software environment as Graham. As a result, [[Graham Tips]] will provide more relevant information. '''''Most things here will likely not work.'''''


== Courteous Usage ==


The following are recommendations for courteous usage and are not required, but are somewhat strongly recommend (especially the first two).
The following are a list of scripts, functions and aliases that will make your life on Sharcnet much easier. Section 1.1, [[#Things to consider|Things to consider]], lists recommendations for courteous usage of Sharcnet resources. Section 1.2, [[#A script to handle this for you|A script to handle this for you]], combines these ideas into a short script which makes job submission easy. Section 2, [[#~/.bashrc|~/.bashrc]] lists other useful commands for checking the sharcnet queue, connecting to the development nodes, or changing directories to a running job. Section 3, [[#Remote Mount of Sharcnet|Remote Mount of Sharcnet]], gives directions on how to mount Sharcnet files onto a Mac so that they appear as if they were local.


== Submitting Jobs with Courteous and Efficient Usage ==
The following are recommendations for courteous usage and are not required, but are somewhat strongly recommend (especially the first two). If this is too much to remember, the code block below can be used to do all the work for you
=== Things to consider ===
<!--
* When submitting mpi runs, consider using the '''ppn''' flag
* When submitting mpi runs, consider using the '''ppn''' flag
** This flag specifies the number of ''processors per node'', and helps to keep your jobs using whole nodes when available (avoids fragmenting jobs over more nodes than necessary)
** This flag specifies the number of ''processors per node'', and helps to keep your jobs using whole nodes when available (avoids fragmenting jobs over more nodes than necessary)
Line 10: Line 16:
*** If you are using fewer than 16 processors, say 8, then you could also use '''--ppn=8''' to group your processors on one node.
*** If you are using fewer than 16 processors, say 8, then you could also use '''--ppn=8''' to group your processors on one node.
** While using the '''ppn''' may increase wait times, if everyone uses it then the effect should be minimal and will improve job performance (and consistency of scaleability).
** While using the '''ppn''' may increase wait times, if everyone uses it then the effect should be minimal and will improve job performance (and consistency of scaleability).
* When submitting jobs, consider how much memory you will need.
-->
** The '''mpp''' flag specifies the ''memory per processor''
 
** If you are using whole nodes ('''--ppn=16'''), then you may as well use all of the memory ('''--mpp=4032M''', for the low memory nodes)
<!-- Item 1 -->
** If you are not using a whole node, then try to only request as much memory as is needed. This will permit other jobs to run on the same node.
# If your simulation is using 16 processors or fewer, consider using the '''ppn''' flag
* When submitting many jobs, consider linking them using the '''w''' flag
#* This flag specifies the number of ''processors per node'', and helps to keep your jobs using whole nodes when available (avoids fragmenting jobs over more nodes than necessary)
** This flag instructions a job to ''wait'' until other jobs have finished.  
#* Using no more than 16 processors means that your job can fit on a single node, which will reduce communication costs and provide a speed-up.
*** Suppose you want to submit two large (processor wise) jobs. If the first job has a job id of 1234567 after it is submitted, then if the second job is submitted with '''-w 1234567''', it will not run until the first job has finished.
<!-- Item 2 -->
*** While the first job is running, the second job will continue to build priority, so it will be more likely to start shortly after the first job finished (especially if the second job doesn't require more processors than the first).
#** If you are using 8 processors, then use '''--ppn=8'''.
**** NOTE: If the first job dies (runs out of time, memory, or has an internal error), the the second job will not run.
# When submitting jobs, consider how much memory you will need.
** Note that this flag is only important when the queue is busy and your jobs would otherwise monopolize the available processors  
#* The '''mpp''' flag specifies the ''memory per processor''
*** As a rough guide, try to limit yourself to 128 processors at a time, although more can certainly be used when necessary.
#* If you are using whole nodes ('''--ppn=16'''), then you may as well use all of the memory ('''--mpp=4032M''', for the low memory nodes, and '''--mpp=8064M''', for the low memory nodes)
** The '''w''' flag can also be very useful during holidays/conferences/vacations to submit a series of jobs without blocking other users unnecessarily.
#* If you have not used the '''ppn''' flag, then try to only request as much memory as is needed. This will permit other jobs to run on the same node.
<!-- Item 3 -->
# When submitting many jobs, consider linking them using the '''w''' flag
#* This flag instructs a job to ''wait'' until other jobs have finished.  
#** Suppose you want to submit two large (processor wise) jobs. If the first job has a job id of 1234567 after it is submitted, then if the second job is submitted with '''-w 1234567''', it will not run until the first job has finished.
#** While the first job is running, the second job will continue to build priority, so it will be more likely to start shortly after the first job finished (especially if the second job doesn't require more processors than the first).
#*** NOTE: If the first job dies (runs out of time, memory, or has an internal error), the the second job will not run.
#* Note that this flag is only important when the queue is busy and your jobs would otherwise monopolize the available processors  
#** As a rough guide, try to limit yourself to 128 processors at a time, although more can certainly be used when necessary.
#* The '''w''' flag can also be very useful during holidays/conferences/vacations to submit a series of jobs without blocking other users unnecessarily.
 
=== A script to handle this for you ===
All of this is done for you by using the following bash script (I suggest calling it submit.sh, and don't forget to make it executable: <code>chmod +x submit.sh</code>).


Copy this file into the directory and set the parameters for the given run in the Options section. Then submit the job by typing <code>./submit.sh</code>. Since this file won't go anywhere it is a handy way to look up what submission parameters you used (number of processors, memory, etc).
The options available are:
*QUEUE    - the queue the job will run in, there is no good reason not to use the kglamb queue
*NUM_PROCS - number of processors
*MPP      - memory per processor
*RUNTIME  - Run time of the job. Must be less than or equal to 7 days
*DELAY    - whether to wait for another job to complete (true or false)
*DELAY_FOR - ID of job to complete before starting (not used if DELAY=false)
*NAME      - name of job
*EXE_NAME  - name of executable file
The '''ppn''' option is not in the script but can easily be added if desired. An analogous script for Matlab can be found at [[MATLAB on SHARCNET]].
Here is the script to put into <code>submit.sh</code>:
<syntaxhighlight lang="bash">
#!/bin/bash
# bash script for submitting a job to the sharcnet queue
#### Options ####
QUEUE=kglamb
NUM_PROCS=128
MPP=3.9g
RUNTIME=7d
DELAY=false
DELAY_FOR=9175185
NAME=grav_cur
EXE_NAME=grav_cur.x
#### OTHER INFO ####
DATE=`date +%Y-%m-%d_%H\h%M`
LOG_NAME="${DATE}.log"
ERR_NAME="${DATE}.err"
#### Submit ####
if [[ -x ${EXE_NAME} ]]; then
    if [[ "${DELAY}" = false ]]; then
        sqsub -q ${QUEUE} \
              -f mpi --nompirun \
              -n ${NUM_PROCS} \
              --mpp=${MPP} \
              -r ${RUNTIME} \
              -j ${NAME} \
              -o ${LOG_NAME} \
              -e ${ERR_NAME} \
              mpirun -mca mpi_leave_pinned 0 ${EXE_NAME}
    elif [[ "${DELAY}" = true ]]; then
        sqsub -q ${QUEUE} \
              -f mpi --nompirun \
              -n ${NUM_PROCS} \
              --mpp=${MPP} \
              -r ${RUNTIME} \
              -j ${NAME} \
              -o ${LOG_NAME} \
              -e ${ERR_NAME} \
              -w ${DELAY_FOR} \
              mpirun -mca mpi_leave_pinned 0 ${EXE_NAME}
    fi
    echo "Submitted ${EXE_NAME} with ${NUM_PROCS} processors"
    echo "          Requested memory:  ${MPP}"
    echo "          Requested runtime: ${RUNTIME}"
    echo "          Log file: ${LOG_NAME}"
else
    echo "Couldn't find ${EXE_NAME} - Try again."
fi
</syntaxhighlight>


== ~/.bashrc ==
== ~/.bashrc ==
Line 36: Line 119:
** You will want to replace <userid> with your userid when adding these to '''~/.bashrc'''
** You will want to replace <userid> with your userid when adding these to '''~/.bashrc'''


alias sqa='showq -w class=kglamb'
<syntaxhighlight lang="bash">
alias sqmpi='showq -w class=mpi'
alias sqa='showq -w class=kglamb'
alias sqh='sqhosts orc361-392'
alias sqmpi='showq -w class=mpi'
alias sqm='showq -w user=<userid>'
alias sqh='sqhosts orc361-392'
alias sqm='showq -w user=<userid>'
</syntaxhighlight>
 
For these to work properly, you must also have the following in '''~/.bash_profile''':
<syntaxhighlight lang="bash">
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi
</syntaxhighlight>


=== Accessing Development Nodes ===
=== Accessing Development Nodes ===


The development notes provide an opportunity to run code directly, which can be very useful for development and testing.
The development nodes provide an opportunity to run code directly, which can be very useful for development and testing.
The following functions simplify the connection process.
The following functions simplify the connection process.


* DevUsage provides a summary of how busy the nodes are (in terms of processor usage, each has 16 processors, so the lower the better)
* DevUsage provides a summary of how busy the nodes are. In terms of processor usage, each has 16 processors, so the lower the better. A value less than 4 will allow commands to run at a reasonable pace, anything over 10 is unbearably slow. See [https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB8QFjAA&url=https%3A%2F%2Fwww.sharcnet.ca%2Fhelp%2Findex.php%2FOrca&ei=uR8PVNPdHceKsQS3yIKoDQ&usg=AFQjCNHr8JWbktLeKZF6YNe3KO6F_7LN6Q&bvm=bv.74649129,d.cWc Documentation] for more information.
* DevConnect automatically connects to the least used development node.
* DevConnect automatically connects to the least used development node.


alias DevUsage="pdsh -w orc-dev[1-4] uptime | awk '{print \$1,\$NF}' | sort -n -k 2"
<syntaxhighlight lang="bash">
function DevConnect() {
alias DevUsage="pdsh -w orc-dev[1-4] uptime | awk '{print \$1,\$NF}' | sort -n -k 2"
    var=$(pdsh -w orc-dev[1-4] uptime | awk '{print $1,$NF}' | sort -n -k 2)
 
    var2=${var:0:8}
function DevConnect() {
    printf "\n*** Accessing dev node: %s ***\n\n" $var2
    var=$(pdsh -w orc-dev[1-4] uptime | awk '{print $1,$NF}' | sort -n -k 2)
    ssh -X $var2
    var2=${var:0:8}
}
    printf "\n*** Accessing dev node: %s ***\n\n" $var2
export -f DevConnect
    ssh -X $var2
}
export -f DevConnect
</syntaxhighlight>
 
=== Moving to a Simulation/Job Directory ===
 
You may find that your directory tree becomes rather involved after a while, and so changing into a simulation directory / remembering the path can start to cumbersome. A useful function is cdJob, which takes you into the working directory for a submitted Sharcnet job, provided that you know the jobID.
 
<syntaxhighlight lang="bash">
function cdJob() {
  pth=$(sqjobs -l $1 | grep working | sed 's/^[^:]*: //g')
  echo "cd-ing to ${pth}"
  cd ${pth}
}
export -f cdJob
</syntaxhighlight>
 
Usage is:
$ cdJob <jobID>
 
== Remote Mount of Sharcnet ==
Sharcnet files can be mounted on a mac with sshfs to appear as if they were a part of your local filesystem. This allows local programs not available on Sharcnet to be used. It is a third way of running Matlab with sharcnet data (see [[MATLAB on SHARCNET]] for the other two), though it will be slower because data will need to be transfered through ssh to your local computer each time you do read from disk. Some memory will be stored in cache, but it will not be enough. Use this method only for reading 2D slices or small 3D fields. Reading grid sizes of 512x512 are easily done. Larger are grids are certainly possible, but they will be slower.
 
To download sshfs onto a mac:
# Go to [https://osxfuse.github.io/ Fuse]
# Install FUSE for macOS x.x.x (under stable releases on right)
# Install SSHFS x.x.x
 
In terminal, create directories to place the mounted folders (this creates directories under you home directory, but you could create them anywhere. You will need to adjust the following bash commands appropriately):
mkdir ~/Orca_home
mkdir ~/Orca_work
mkdir ~/Orca_scratch
 
Load the remote directories onto those folders (just replace <USER> with your user ID):
sshfs <USER>@orca.sharcnet.ca: ~/Orca_home
sshfs <USER>@orca.sharcnet.ca:/work/<USER> ~/Orca_work
sshfs -o uid=${UID} <USER>@orca.sharcnet.ca:/scratch/kglamb/<USER> ~/Orca_scratch
 
Those commands are too long to remember, so create an alias in the ~/.bash_profile file (it is like ~/.bashrc):
<syntaxhighlight lang="bash">
alias load_home='sshfs <USER>@orca.sharcnet.ca: ~/Orca_home'
alias load_work='sshfs <USER>@orca.sharcnet.ca:/work/<USER> ~/Orca_work'
alias load_scratch='sshfs -o uid=${UID} <USER>@orca.sharcnet.ca:/scratch/kglamb/<USER> ~/Orca_scratch'
</syntaxhighlight>
 
Now the command <code>load_work</code> will mount the remote directory onto your local machine.
Note that <code>load_scratch</code> mounts your ''kglamb'' scratch, not your independent scratch.
 
It is important to remember to unmount the directories when you no longer need them.
* Ensure that no programs are accessing the mounted files (this includes terminal sessions that are in a mounted directory)
* Move to the directory above the mount (in the above example, <code>cd ~</code>)
* Unmount the directory: <code>umount <directory></code>, e.g. <code>umount Orca_scratch</code>

Latest revision as of 18:31, 10 January 2019

NOTE: Orca was changed to use the same software environment as Graham. As a result, Graham Tips will provide more relevant information. Most things here will likely not work.


The following are a list of scripts, functions and aliases that will make your life on Sharcnet much easier. Section 1.1, Things to consider, lists recommendations for courteous usage of Sharcnet resources. Section 1.2, A script to handle this for you, combines these ideas into a short script which makes job submission easy. Section 2, ~/.bashrc lists other useful commands for checking the sharcnet queue, connecting to the development nodes, or changing directories to a running job. Section 3, Remote Mount of Sharcnet, gives directions on how to mount Sharcnet files onto a Mac so that they appear as if they were local.

Submitting Jobs with Courteous and Efficient Usage

The following are recommendations for courteous usage and are not required, but are somewhat strongly recommend (especially the first two). If this is too much to remember, the code block below can be used to do all the work for you

Things to consider

  1. If your simulation is using 16 processors or fewer, consider using the ppn flag
    • This flag specifies the number of processors per node, and helps to keep your jobs using whole nodes when available (avoids fragmenting jobs over more nodes than necessary)
    • Using no more than 16 processors means that your job can fit on a single node, which will reduce communication costs and provide a speed-up.
      • If you are using 8 processors, then use --ppn=8.
  2. When submitting jobs, consider how much memory you will need.
    • The mpp flag specifies the memory per processor
    • If you are using whole nodes (--ppn=16), then you may as well use all of the memory (--mpp=4032M, for the low memory nodes, and --mpp=8064M, for the low memory nodes)
    • If you have not used the ppn flag, then try to only request as much memory as is needed. This will permit other jobs to run on the same node.
  3. When submitting many jobs, consider linking them using the w flag
    • This flag instructs a job to wait until other jobs have finished.
      • Suppose you want to submit two large (processor wise) jobs. If the first job has a job id of 1234567 after it is submitted, then if the second job is submitted with -w 1234567, it will not run until the first job has finished.
      • While the first job is running, the second job will continue to build priority, so it will be more likely to start shortly after the first job finished (especially if the second job doesn't require more processors than the first).
        • NOTE: If the first job dies (runs out of time, memory, or has an internal error), the the second job will not run.
    • Note that this flag is only important when the queue is busy and your jobs would otherwise monopolize the available processors
      • As a rough guide, try to limit yourself to 128 processors at a time, although more can certainly be used when necessary.
    • The w flag can also be very useful during holidays/conferences/vacations to submit a series of jobs without blocking other users unnecessarily.

A script to handle this for you

All of this is done for you by using the following bash script (I suggest calling it submit.sh, and don't forget to make it executable: chmod +x submit.sh).

Copy this file into the directory and set the parameters for the given run in the Options section. Then submit the job by typing ./submit.sh. Since this file won't go anywhere it is a handy way to look up what submission parameters you used (number of processors, memory, etc).

The options available are:

  • QUEUE - the queue the job will run in, there is no good reason not to use the kglamb queue
  • NUM_PROCS - number of processors
  • MPP - memory per processor
  • RUNTIME - Run time of the job. Must be less than or equal to 7 days
  • DELAY - whether to wait for another job to complete (true or false)
  • DELAY_FOR - ID of job to complete before starting (not used if DELAY=false)
  • NAME - name of job
  • EXE_NAME - name of executable file

The ppn option is not in the script but can easily be added if desired. An analogous script for Matlab can be found at MATLAB on SHARCNET.

Here is the script to put into submit.sh:

#!/bin/bash
# bash script for submitting a job to the sharcnet queue

#### Options ####
QUEUE=kglamb
NUM_PROCS=128
MPP=3.9g
RUNTIME=7d
DELAY=false
DELAY_FOR=9175185
NAME=grav_cur
EXE_NAME=grav_cur.x

#### OTHER INFO ####
DATE=`date +%Y-%m-%d_%H\h%M`
LOG_NAME="${DATE}.log"
ERR_NAME="${DATE}.err"

#### Submit ####
if [[ -x ${EXE_NAME} ]]; then
    if [[ "${DELAY}" = false ]]; then
        sqsub -q ${QUEUE} \
              -f mpi --nompirun \
              -n ${NUM_PROCS} \
              --mpp=${MPP} \
              -r ${RUNTIME} \
              -j ${NAME} \
              -o ${LOG_NAME} \
              -e ${ERR_NAME} \
              mpirun -mca mpi_leave_pinned 0 ${EXE_NAME}
    elif [[ "${DELAY}" = true ]]; then
        sqsub -q ${QUEUE} \
              -f mpi --nompirun \
              -n ${NUM_PROCS} \
              --mpp=${MPP} \
              -r ${RUNTIME} \
              -j ${NAME} \
              -o ${LOG_NAME} \
              -e ${ERR_NAME} \
              -w ${DELAY_FOR} \
              mpirun -mca mpi_leave_pinned 0 ${EXE_NAME}
    fi
    echo "Submitted ${EXE_NAME} with ${NUM_PROCS} processors"
    echo "          Requested memory:  ${MPP}"
    echo "          Requested runtime: ${RUNTIME}"
    echo "          Log file: ${LOG_NAME}"
else
    echo "Couldn't find ${EXE_NAME} - Try again."
fi

~/.bashrc

Copy the following lines of code into ~/.bashrc to make them available to you at the command line. Note: Thanks to Mike Dunphy and John Yawney for some of these.

Checking Job Status and Nodes Usage

  • sqa gives a summary of all jobs submitted by the kglamb group
  • sqmpi gives a summary of all mpi jobs
  • sqh gives a node-by-node summary of the kglamb nodes
  • sqm gives a summary of all of the jobs submitted by <userid>
    • You will want to replace <userid> with your userid when adding these to ~/.bashrc
alias sqa='showq -w class=kglamb'
alias sqmpi='showq -w class=mpi'
alias sqh='sqhosts orc361-392'
alias sqm='showq -w user=<userid>'

For these to work properly, you must also have the following in ~/.bash_profile:

 # Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

Accessing Development Nodes

The development nodes provide an opportunity to run code directly, which can be very useful for development and testing. The following functions simplify the connection process.

  • DevUsage provides a summary of how busy the nodes are. In terms of processor usage, each has 16 processors, so the lower the better. A value less than 4 will allow commands to run at a reasonable pace, anything over 10 is unbearably slow. See Documentation for more information.
  • DevConnect automatically connects to the least used development node.
alias DevUsage="pdsh -w orc-dev[1-4] uptime | awk '{print \$1,\$NF}' | sort -n -k 2"

function DevConnect() {
    var=$(pdsh -w orc-dev[1-4] uptime | awk '{print $1,$NF}' | sort -n -k 2)
    var2=${var:0:8}
    printf "\n*** Accessing dev node: %s ***\n\n" $var2
    ssh -X $var2
}
export -f DevConnect

Moving to a Simulation/Job Directory

You may find that your directory tree becomes rather involved after a while, and so changing into a simulation directory / remembering the path can start to cumbersome. A useful function is cdJob, which takes you into the working directory for a submitted Sharcnet job, provided that you know the jobID.

function cdJob() {
   pth=$(sqjobs -l $1 | grep working | sed 's/^[^:]*: //g')
   echo "cd-ing to ${pth}"
   cd ${pth}
}
export -f cdJob

Usage is:

$ cdJob <jobID>

Remote Mount of Sharcnet

Sharcnet files can be mounted on a mac with sshfs to appear as if they were a part of your local filesystem. This allows local programs not available on Sharcnet to be used. It is a third way of running Matlab with sharcnet data (see MATLAB on SHARCNET for the other two), though it will be slower because data will need to be transfered through ssh to your local computer each time you do read from disk. Some memory will be stored in cache, but it will not be enough. Use this method only for reading 2D slices or small 3D fields. Reading grid sizes of 512x512 are easily done. Larger are grids are certainly possible, but they will be slower.

To download sshfs onto a mac:

  1. Go to Fuse
  2. Install FUSE for macOS x.x.x (under stable releases on right)
  3. Install SSHFS x.x.x

In terminal, create directories to place the mounted folders (this creates directories under you home directory, but you could create them anywhere. You will need to adjust the following bash commands appropriately):

mkdir ~/Orca_home
mkdir ~/Orca_work
mkdir ~/Orca_scratch

Load the remote directories onto those folders (just replace <USER> with your user ID):

sshfs <USER>@orca.sharcnet.ca: ~/Orca_home
sshfs <USER>@orca.sharcnet.ca:/work/<USER> ~/Orca_work
sshfs -o uid=${UID} <USER>@orca.sharcnet.ca:/scratch/kglamb/<USER> ~/Orca_scratch

Those commands are too long to remember, so create an alias in the ~/.bash_profile file (it is like ~/.bashrc):

alias load_home='sshfs <USER>@orca.sharcnet.ca: ~/Orca_home'
alias load_work='sshfs <USER>@orca.sharcnet.ca:/work/<USER> ~/Orca_work'
alias load_scratch='sshfs -o uid=${UID} <USER>@orca.sharcnet.ca:/scratch/kglamb/<USER> ~/Orca_scratch'

Now the command load_work will mount the remote directory onto your local machine. Note that load_scratch mounts your kglamb scratch, not your independent scratch.

It is important to remember to unmount the directories when you no longer need them.

  • Ensure that no programs are accessing the mounted files (this includes terminal sessions that are in a mounted directory)
  • Move to the directory above the mount (in the above example, cd ~)
  • Unmount the directory: umount <directory>, e.g. umount Orca_scratch