Cluster Job Management: Difference between revisions
Line 7: | Line 7: | ||
Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex | Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex | ||
== SLURM usage == | |||
* Submitting a Job | * Submitting a Job | ||
Line 47: | Line 47: | ||
srun -u -p cortex -t 2:0:0 --pty bash -i | srun -u -p cortex -t 2:0:0 --pty bash -i | ||
== Perceus commands == | |||
The perceus manual is [http://www.warewulf-cluster.org/portal/book/export/html/7 here] | The perceus manual is [http://www.warewulf-cluster.org/portal/book/export/html/7 here] | ||
Line 70: | Line 70: | ||
* help pages are [http://lrc.lbl.gov/html/guide.html here] | * help pages are [http://lrc.lbl.gov/html/guide.html here] | ||
== Finding out the list of occupants on each cluster node == | |||
* One can find out the list of users using a particular node by ssh into the node, e.g. | * One can find out the list of users using a particular node by ssh into the node, e.g. |
Revision as of 23:29, 11 September 2015
Overview
In order to coordinate our cluster usage patterns fairly, our cluster uses a job manager known as SLURM. This page provides some information about common usage patterns for SLURM and cluster etiquette.
See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.
Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex
SLURM usage
- Submitting a Job
From the login node, you can submit jobs to the compute nodes using the syntax
sbatch myscript.sh
where the myscript.sh is an shell script containing the call to the executable to be submitted to the cluster. Typically, for a matlab job, it would look like
#!/bin/bash -l #SBATCH -p cortex #SBATCH --time=03:30:00 #SBATCH --mem-per-cpu=2G cd /clusterfs/cortex/scratch/working/dir/for/your/code module load matlab/R2013a matlab -nodisplay -nojvm -r "mymatlabfunction( parameters); exit" exit
the --time defines the walltime of the job, which is an upper bound on the estimated runtime. The job will be killed after this time is elapsed. --mem specifies how much memory the job requires, the default is 1GB per job.
- Monitoring Jobs
Additional options can be passed to sbatch to monitor outputs from the running jobs
sbatch -o outputfile.txt -e errofile.txt -J jobdescriptor myscript.sh
the output of the job will be piped to outputfile.txt and any errors if the job crashes to errofile.txt
- Cluster usage
Use
squeue
to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.
To start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):
srun -u -p cortex -t 2:0:0 --pty bash -i
Perceus commands
The perceus manual is here
- listing available cluster nodes:
wwstats wwnodes
- list cluster usage
wwtop
- to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
export NODES='*cortex'
- module list
- module avail
- module help
- help pages are here
Finding out the list of occupants on each cluster node
- One can find out the list of users using a particular node by ssh into the node, e.g.
ssh n0000.cortex
- After logging into the node, type
top
- This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.