Cluster Job Management
In order to coordinate our cluster usage patterns fairly, our cluster uses a job manager known as SLURM. This page provides some information about common usage patterns for SLURM and cluster etiquette.
See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.
Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex
Basic job submission
From the login node, you can submit jobs to the compute nodes using the syntax
where the myscript.sh is a shell script that executes your program. Slurm has several parameters that need to specified. For example, you might like to specify an output file for your job. This can done using the -o argument:
sbatch -o outputfile.txt myscript.sh
Remembering to include all the arguments you need can get cumbersome. An easier way is to include the slurm parameters in the header of the script that you submit. This done by putting one or more lines starting with #SBATCH near the start of your file. For example, here is simple script that does specifies output file in the script (rather than as command line argument):
#!/bin/bash -l #SBATCH -o myscript.sh ...
Where ... is the body of your script.
Common SLURM arguments
It's a good idea to assign your job a name. This helps make your job identifiable when using other command such as squeue (see below). To do this use the -J argument:
Good cluster etiquette dictates that you set a walltime (i.e. an upper bound on how long it can run) on your script. This helps SLURM fairly schedule jobs. For example to limit your job to one hour of execution time use
In addition capturing standard output (-o) of your process you can also capture the standard error (-e) output (e.g. if running job causes compile errors):
to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.
To start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):
srun -u -p cortex -t 2:0:0 --pty bash -i
The perceus manual is here
- listing available cluster nodes:
- list cluster usage
- to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
- module list
- module avail
- module help
- help pages are here
Finding out the list of occupants on each cluster node
- One can find out the list of users using a particular node by ssh into the node, e.g.
- After logging into the node, type
- This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.