Cluster Job Management: Difference between revisions

Revision as of 23:39, 11 September 2015

Overview

In order to coordinate our cluster usage patterns fairly, our cluster uses a job manager known as SLURM. This page provides some information about common usage patterns for SLURM and cluster etiquette.

See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.

Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex

Basic job submission

From the login node, you can submit jobs to the compute nodes using the syntax

 sbatch myscript.sh

where the myscript.sh is a shell script that executes your program.

Slurm has several parameters that need to specified. For example, you might like to specify an output file for your job. This can done using the -o argument:

 sbatch -o outputfile.txt  myscript.sh

 #!/bin/bash -l
 #SBATCH -p cortex
 #SBATCH --time=03:30:00
 #SBATCH --mem-per-cpu=2G
 cd /clusterfs/cortex/scratch/working/dir/for/your/code
 module load matlab/R2013a
 matlab -nodisplay -nojvm -r "mymatlabfunction( parameters); exit"
 exit

the --time defines the walltime of the job, which is an upper bound on the estimated runtime. The job will be killed after this time is elapsed. --mem specifies how much memory the job requires, the default is 1GB per job.

Monitoring Jobs

Additional options can be passed to sbatch to monitor outputs from the running jobs

   sbatch -o outputfile.txt -e errofile.txt -J jobdescriptor myscript.sh

the output of the job will be piped to outputfile.txt and any errors if the job crashes to errofile.txt

Cluster usage

Use

 squeue

to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.

To start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):

 srun -u -p cortex -t 2:0:0 --pty bash -i

Perceus commands

The perceus manual is here

listing available cluster nodes:

 wwstats
 wwnodes

list cluster usage

 wwtop

to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc

 export NODES='*cortex'

module list
module avail
module help

help pages are here

Finding out the list of occupants on each cluster node

One can find out the list of users using a particular node by ssh into the node, e.g.

 ssh n0000.cortex

After logging into the node, type

top

This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.

Cluster Job Management: Difference between revisions

Revision as of 23:39, 11 September 2015

Contents

Overview

Basic job submission

Perceus commands

Finding out the list of occupants on each cluster node

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 7: / Line 7: @@
 Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex
-== SLURM usage ==
+== Basic job submission ==
-* Submitting a Job
 From the login node, you can submit jobs to the compute nodes using the syntax
@@ Line 15: / Line 13: @@
    sbatch myscript.sh
-where the myscript.sh is an shell script containing the call to the executable to be submitted to the cluster. Typically, for a matlab job, it would look like
+where the myscript.sh is a shell script that executes your program.
+Slurm has several parameters that need to specified. For example, you might like to specify an output file for your job. This can done using the -o argument:
+  sbatch -o outputfile.txt  myscript.sh
    #!/bin/bash -l