Difference between revisions of "Cluster"

From RedwoodCenter
Jump to navigationJump to search
 
(135 intermediate revisions by 10 users not shown)
Line 1: Line 1:
 
= General Information =
 
= General Information =
 +
 +
The Redwood computing cluster consists of about a dozen somewhat heterogeneous machines, some with graphics cards (GPUs), and one very clever wombat who can optimize your neural network for you if you ask nicely.  The typical use cases for the cluster are that you have jobs that run in parallel which are independent, so having several machines will complete the task faster, even though any one machine might not be faster than your own laptop. Or you have a long running job which may take a day, and you don't want to worry about having to leave your laptop on at all times and not be able to use it. Another reason is that your code leverages a communication scheme (such as MPI) to have multiple machines cooperatively work on a problem. Lastly, if you want to do long GPU computations.
 +
 +
In order for the cluster to be useful and well-utilized, it works best for everyone to submit jobs TODO (see '''SLURM''' further down on this page for the details) to the queue.  A job may not start right away, but will get run once its turn comes. Please do not run extended interactive sessions or ssh directly to worker nodes for performing computation.
 +
 +
== Cluster Administration ==
 +
 +
[[ClusterAdmin]] has information about cluster administration.
 +
 +
== Hardware Overview ==
 +
 +
The current hardware and node configuration is listed [https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/ucb-non-supercluster/cortex here].
 +
 +
In addition to the compute nodes we ave a 17TB file server at
 +
  /clusterfs/cortex/users
 +
which is mounted as scratch space.
 +
 +
In brief, we have 14 nodes with over 60 cores and 4 GPUs.
 +
 +
== Getting an account and one-time password service ==
 +
In order to get an account on the cluster, please send an email to Bruno (baolshausen AT berk...edu) with the following information:
 +
 +
    Full Name <emailaddress> desiredusername
 +
 +
Please also include a note about which PI you are working with. Note: the '''desireusername''' must be 3-8 characters long, so it would have been truncated to '''desireus''' in this case.
 +
 +
'''OTP (One Time Password) Service'''
 +
 +
Once you have a username, you will need to follow the instructions found [https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/authentication/linotp-usage] to set up the Google Authenticator application, which gives you a one-time password for logging into the cluster.
  
 
== Directory setup ==
 
== Directory setup ==
  
=== home directory quota ===
+
=== Home Directory Quota ===
  
 
There is a 10GB quota limit enforced on $HOME directory
 
There is a 10GB quota limit enforced on $HOME directory
 
(/global/home/users/username) usage. Please keep your usage below
 
(/global/home/users/username) usage. Please keep your usage below
this limit. There will NETAPP snapshots in place in this file
+
this limit. There will be NETAPP snapshots in place in this file
 
system so we suggest you store only your source code and scripts
 
system so we suggest you store only your source code and scripts
 
in this area and store all your data under /clusterfs/cortex
 
in this area and store all your data under /clusterfs/cortex
 
(see below).
 
(see below).
  
In order to see your current quota and usage, use the following command:
+
In order to see your current quota and usage, use the following command: TODO
  
 
   quota -s
 
   quota -s
  
=== data ===
+
=== Data ===
  
 
For large amounts of data, please create a directory
 
For large amounts of data, please create a directory
  
   /clusterfs/cortex/scratch/username
+
   /clusterfs/cortex/users/username
  
and store the data inside that directory.
+
and store the data inside that directory. Note that unlike the home directory, scratch space is not backed up and permanence of your data is not guaranteed. There is a total limit of 17 TB for this drive that is shared by everyone at the Redwood Center.
  
 
== Connect ==
 
== Connect ==
  
=== get a password ===
+
=== ssh to a login node ===
  
* press the PASS WORD button on your crypto card
+
  ssh -Y username@hpc.brc.berkeley.edu
* enter passoword
 
* press enter
 
* the 7 digit password is given (without the dash)
 
  
=== ssh to the gateway computer (hadley) ===
+
and use your one-time password.
  
''' note: please don't use the gateway for computations (e.g. matlab)! '''
+
If you intend on working with a remote GUI session you can add a -C flag to the command above to enable compression data to be sent through the ssh tunnel.
  
  ssh -Y neuro-calhpc.berkeley.edu (or hadley.berkeley.edu)  
+
''' note: please don't use the login nodes for computations (e.g. matlab, python)! '''
  
and use your crypto password
+
==== Google Authenticator App (get a password) ====
 +
 
 +
* Open the google Authenticator App
 +
* Enter your personal pin
 +
* Enter the one-time pin
  
 
=== Setup environment ===
 
=== Setup environment ===
Line 45: Line 75:
 
* put all your customizations into your .bashrc  
 
* put all your customizations into your .bashrc  
 
* for login shells, .bash_profile is used, which in turn loads .bashrc
 
* for login shells, .bash_profile is used, which in turn loads .bashrc
 +
 +
=== Using a Windows machine ===
 +
Windows is not a Unix-based operating system and as a result does not natively interface with a Unix environment. Download the 2 following pieces of software to create a workaround:
 +
* Install a Unix environment emulator to interface directly with the cluster. Cygwin [http://www.cygwin.com] seems to work well. During installation make sure to install Net -> "openssh". Editors -> "vim" is also recommended. Then you can use the instructions detailed in ssh to gateway above
 +
* Install an SFTP/SCP/FTP client to allow for file sharing between the cluster and your local machine. WinSCP [http://www.winscp.net] is recommended. ExpanDrive can also be used to create a cluster-based network drive on your local machine.
  
 
== Useful commands ==
 
== Useful commands ==
  
=== Start interactive session on compute node ===
+
See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.
  
* start interactive session:  
+
Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex
  
  qsub -X -I
+
=== SLURM  ===
  
* start interactive session on particular node (nodes n0000.cortex and n0001.cortex have GPUs):
+
SLURM is our scheduler. It is very important you understand SLURM well to have a good time doing research on the cluster. SLURM is our administrator on the cluster, it helps you find resources for your job. It also helps others do the same, so we are not stepping on each others' toes. There are some do's and don'ts with using SLURM.
  
  qsub -X -I -l nodes=n0001.cortex
+
* Logging in -- when you login to the cluster, you end up landing on the login node. We do not own the login node and share this with other members of the Berkeley Research Consortium. So, it is important not to run anything here *at all*
  
=== Perceus commands ===
+
* Information on  Submitting, Monitoring, Reviewing Jobs can be found here. You can do many simple BASH tricks to submit a large number of embarrassingly parallel jobs on the cluster. This is great for parameter sweeps.
  
The perceus manual is [http://www.warewulf-cluster.org/portal/book/export/html/7 here]
+
* Storage -- every user gets a 10 GB quota gratis from the BRC. This is your home folder or where you land when you login. In addition to this there's a 20TB scratch space (/clusterfs/cortex/scratch) shared by all members of the Redwood Center. We have a log of how much space is being used by each member who writes into the scratch folder at (TODO)
  
* listing available cluster nodes:
+
* We have 4 GPU nodes and information on requesting and using them can be found here. When you request a GPU as a resource, you get the whole node along with it.
  
  wwstats
+
* We have a debug queue that can be requested for research here
  
* list cluster usage
 
  
  wwtop
+
* Submitting a Job
  
* to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
+
From the login node, you can submit jobs to the compute nodes using the syntax
  
   export NODES='*cortex'
+
   sbatch myscript.sh
  
* module list
+
where the myscript.sh is an shell script containing the call to the executable to be submitted to the cluster. Typically, for a matlab job, it would look like
* module avail
 
* module help
 
  
* help pages are [http://lrc.lbl.gov/html/guide.html here]
+
  #!/bin/bash -l
 +
  #SBATCH -p cortex
 +
  #SBATCH --time=03:30:00
 +
  #SBATCH --mem-per-cpu=2G
 +
  cd /clusterfs/cortex/scratch/working/dir/for/your/code
 +
  module load matlab/R2013a
 +
  matlab -nodisplay -nojvm -r "mymatlabfunction( parameters); exit"
 +
  exit
  
 +
the --time defines the walltime of the job, which is an upper bound on the estimated runtime. The job will be killed after this time is elapsed. --mem specifies how much memory the job requires, the default is 1GB per job.
  
=== Resource Manager PBS ===
+
* Monitoring Jobs
  
* Job Scheduler MOAB
+
Additional options can be passed to sbatch to monitor outputs from the running jobs
* List running jobs:
 
  
  qstat -a
+
    sbatch -o outputfile.txt -e errofile.txt -J jobdescriptor myscript.sh
  
* List jobs of a given node:
+
the output of the job will be piped to outputfile.txt and any errors if the job crashes to errofile.txt
  
  qstat -n 98
+
* Cluster usage
  
* sample script
+
Use
 +
  squeue
 +
to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.
  
  #!/bin/bash
 
 
 
  #PBS -q cortex
 
  #PBS -l nodes=1:ppn=2:cortex
 
  #PBS -l walltime=01:00:00
 
  #PBS -o path-to-output
 
  #PBS -e path-to-error
 
  cd /global/home/users/kilian/sample_executables
 
  cat $PBS_NODEFILE
 
  mpirun -np 8 /bin/hostname
 
  sleep 60
 
  
* submit script
+
To start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):
  
   qsub scriptname
+
   srun -u -p cortex -t 2:0:0 --pty bash -i
  
* interactive session
+
=== Perceus commands ===
  
  qsub -I -q cortex -l nodes=1:ppn=2:cortex -l walltime=00:15:00
+
The perceus manual is [http://www.warewulf-cluster.org/portal/book/export/html/7 here]
  
* flush STDOUT and STDERR to files in your home directory so you can tail the output of the job while it's running
+
* listing available cluster nodes:
  
   qsub -k oe scriptname
+
   wwstats
 +
  wwnodes
  
* remove a queued/running job (you can get the job_id from qstat)
+
* list cluster usage
  
   qdel job_id
+
   wwtop
  
* list nodes that your job is running on
+
* to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
  
   cat $PBS_NODEFILE
+
   export NODES='*cortex'
  
* run the program on several cores
+
* module list
 +
* module avail
 +
* module help
  
  mpirun -np 4 -mca btl ^openib sample_executables/mpi_hello
+
* help pages are [http://lrc.lbl.gov/html/guide.html here]
  
 
=== Finding out the list of occupants on each cluster node ===
 
=== Finding out the list of occupants on each cluster node ===
Line 141: Line 174:
  
 
* This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.
 
* This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.
 +
 +
= Job Management =
 +
 +
In order to coordinate our cluster usage patterns fairly, our cluster uses a job manager known as SLURM. If your are planning to run jobs on the cluster you should be using SLURM! Learn how [http://redwood.berkeley.edu/wiki/Cluster_Job_Management here].
  
 
= Software =
 
= Software =
 +
Information on what software is installed on the cluster and how to access it is [http://redwood.berkeley.edu/wiki/Cluster-Software here].
  
 
== Matlab ==
 
== Matlab ==
 +
Matlab instructions are [http://redwood.berkeley.edu/wiki/Cluster-Software#Matlab here].
  
'''note: remember to start an interactive session before starting matlab!'''
+
== Python ==
 +
Python instructions are [http://redwood.berkeley.edu/wiki/Cluster-Software#Python here].
  
In order to use matlab, you have to load the matlab environment:
+
= Usage Tips TODO =
 +
Here are some tips on how to effectively use the cluster.
  
  module load matlab/R2010a
+
== Embarrassingly Parallel Submissions ==
          -or-
 
  module load matlab/R2007a
 
  
Once the matlab environment is loaded, you can start a matlab session by running
+
Here is an alternate script to do embarrassingly parallel submissions on the cluster.
 
+
   matlab -nojvm -nodesktop
+
iterate.sh
 
+
  #!/bin/sh
An example PBS script for running matlab code is
+
  #Leap Size
 +
  param1=11
 +
  param2=1.2
 +
  param3=.75
 +
  #LeapSize
 +
  for i in 14 15 16
 +
  do
 +
   #Epsilon
 +
  for j in $(seq .8 .1 $param2);
 +
      do
 +
      #Beta
 +
      for k in $(seq .65 .01 $param3);
 +
            do
 +
                echo $i,$j,$k
 +
                qsub param_test.sh  -v "LeapSize=$i,Epsilon=$j,Beta=$k"
 +
            done
 +
      done
 +
  done
  
 +
param_test.sh
 
   #!/bin/bash
 
   #!/bin/bash
 
   #PBS -q cortex
 
   #PBS -q cortex
  # request 1 nodes with 2 CPUs
+
   #PBS -l nodes=1:ppn=2:gpu
   #PBS -l nodes=1:ppn=2
+
   #PBS -l walltime=10:35:00
  # reserve time on the selected cores
+
  #PBS -o /global/home/users/mayur/Logs
   #PBS -l walltime=01:00:00
+
  #PBS -e /global/home/users/mayur/Errors
 +
  cd /global/home/users/mayur/HMC_reducedflip/
 
   module load matlab
 
   module load matlab
   matlab -nodisplay -nojvm << EOF
+
   echo "Epsilon = ",$Epsilon
  test # here you should have whatever you would normally type in the Matlab prompt
+
   echo "Leap Size = ",$LeapSize
  exit
+
   echo "Beta = ",$Beta
  EOF
+
   matlab -nodisplay -nojvm -r "make_figures_fneval_cluster $LeapSize $Epsilon $Beta"
 
 
If you would like to see who is using matlab licenses, enter
 
 
 
  lmstat
 
 
 
== Python ==
 
 
 
We have several Python Distributions installed: The Enthought Python Distribution (EPD), the Source Python Distribution (SPD) and Sage. The easiest way to get started is probably to use EPD (see below).
 
 
 
=== Enthought Python Distribution (EPD) ===
 
 
 
We have the Enthought Python Distribution 6.3.1 installed [[http://www.enthought.com/products/epd.php EPD]]. In order to use it, you have to follow the following steps:
 
 
 
* login to the gateway server using "ssh -Y" (see above)
 
* start an interactive session using "qsub -I -X" (see above)
 
* load the python environment module:
 
 
 
  module load python/epd
 
 
 
* start ipython:
 
 
 
  ipython -pylab
 
 
 
* run the following commands inside ipython to test the setup:
 
 
 
  from enthought.mayavi import mlab
 
  mlab.test_contour3d()
 
 
 
<!--
 
=== Source Python Distribution (SPD) ===
 
 
 
We have the Source Python Distribution installed [[http://code.google.com/p/spdproject/ SPD]]. In order to use it, you have to first load the python environment module:
 
 
 
   module load python/spd
 
 
 
Afterwards, you can run ipython
 
 
 
  % ipython -pylab
 
 
 
At the moment, we have numpy, scipy, and matplotlib installed. If you would like to have additional modules installed, let me know [[mailto:kilian@berkeley.edu kilian]]
 
 
 
=== Sage ===
 
 
 
Sage is [http://sagemath.org http://sagemath.org]. In order to use sage, you have to first load the sage environment module
 
 
 
   module load python/sage
 
 
 
After loading the sage module, if you want to have a scipy environment (run ipython, etc) in your interactive session, first do:
 
 
 
  % sage -sh
 
 
 
then you can run:
 
 
 
  % ipython
 
 
 
or you can just do:
 
 
 
  % sage -ipython
 
 
 
This is a temporary solution for people wanting use scipy with mpi on the cluster. It was built against the default openmpi (1.2.8) (icc) and mpi4py 1.1.0. For those using hdf5, I also built hdf5 1.8.3 (gcc)  and h5py 1.2.
 
 
 
Sample pbs and mpi script is here:
 
 
 
  ~amirk/test
 
 
 
You can run it as:
 
 
 
  % mkdir -p ~/jobs
 
  % cd ~amirk/test
 
  % qsub pbs
 
 
 
--Amir
 
-->
 
 
 
== CUDA ==
 
 
 
CUDA is a library to use the graphics processing units (GPU) on the graphics card for general-purpose computing. We have a separate wiki page to collect information on how to do general-purpose computing on the GPU: [[GPGPU]].
 
We have installed the CUDA 3.0 driver and toolkit.
 
 
 
In order to use CUDA, you have to load the CUDA environment:
 
 
 
   module load cuda
 
 
 
=== CUDA SDK (Outdated since version change to 3.0) ===
 
 
 
You can install the CUDA SDK by running
 
 
 
  bash /clusterfs/cortex/software/cuda-2.3/src/cudasdk_2.3_linux.run
 
 
 
You can compile all the code examples by running
 
 
 
  module load X11
 
  module load Mesa/7.4.4
 
  cd ~/NVIDIA_GPU_Computing_SDK/C
 
  make
 
 
 
The compiled examples can be found in the directory
 
 
 
  ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release
 
 
 
'''note:''' The examples using graphics with OpenGL don't seem to run on a remote X server. In order to make them work, we probably need to install something like [http://www.virtualgl.org/ virtualgl].
 
 
 
<!--
 
=== PyCuda ===
 
 
 
PyCuda 0.93 is installed as part of the Source Python Distribution (SPD). This is how you run all unit tests:
 
 
 
  module load python/spd
 
  cd /clusterfs/cortex/software/src/pycuda-0.93/test/
 
  nosetests
 
 
 
If you are having trouble installing PyCuda, please note the following:
 
 
 
* gcc 4.1.2 related issues with boost [http://tinyurl.com/28zrjnv]
 
* also, gcc 4.1.2 related [http://tinyurl.com/25obx6g]
 
-->
 
 
 
= Usage Tips =
 
Here are some tips on how to effectively use the cluster.
 
 
 
== Mounting Cluster File System ==
 
Mounting the cluster file system remotely allows you to easily access files on the cluster, and allows you to use local programs to edit code or examine simulation outputs locally (very useful). I often edit the remote code using a text editor running on my local machine. This allows you to take advantage of the niceties of a native editor without having to copy code back and forth before you run a simulation on the cluster.
 
 
 
On linux distributions you can mount your cluster home directory locally using sshfs [http://fuse.sourceforge.net/sshfs.html]
 
 
 
  sshfs hadley.berkeley.edu: <mount-dir>
 
  
On Mac and Windows machines the program ExpanDrive works well (uses Fuse under the hood): [http://www.expandrive.com]
+
  Now run ./iterate.sh
  
 
= Support Requests =
 
= Support Requests =
Line 306: Line 238:
 
   [mailto:redwood_cluster@lists.berkeley.edu redwood_cluster@lists.berkeley.edu]
 
   [mailto:redwood_cluster@lists.berkeley.edu redwood_cluster@lists.berkeley.edu]
  
* If you need additional help from the LBL group, send an email to their email list. Please always cc our email list as well.
+
* If you need additional help from the LBL group, send an email to their email list. Please always cc our email list as well. Or visit their website[https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/].
  
   [mailto:scs@lbl.gov scs@lbl.gov]
+
   [mailto:hpcshelp@lbl.gov hpcshelp@lbl.gov]
  
 
* In urgent cases, you can also email [mailto:kmuriki@lbl.gov Krishna Muriki] (LBL User Services) directly.
 
* In urgent cases, you can also email [mailto:kmuriki@lbl.gov Krishna Muriki] (LBL User Services) directly.

Latest revision as of 00:37, 11 January 2017

General Information

The Redwood computing cluster consists of about a dozen somewhat heterogeneous machines, some with graphics cards (GPUs), and one very clever wombat who can optimize your neural network for you if you ask nicely. The typical use cases for the cluster are that you have jobs that run in parallel which are independent, so having several machines will complete the task faster, even though any one machine might not be faster than your own laptop. Or you have a long running job which may take a day, and you don't want to worry about having to leave your laptop on at all times and not be able to use it. Another reason is that your code leverages a communication scheme (such as MPI) to have multiple machines cooperatively work on a problem. Lastly, if you want to do long GPU computations.

In order for the cluster to be useful and well-utilized, it works best for everyone to submit jobs TODO (see SLURM further down on this page for the details) to the queue. A job may not start right away, but will get run once its turn comes. Please do not run extended interactive sessions or ssh directly to worker nodes for performing computation.

Cluster Administration

ClusterAdmin has information about cluster administration.

Hardware Overview

The current hardware and node configuration is listed here.

In addition to the compute nodes we ave a 17TB file server at

 /clusterfs/cortex/users

which is mounted as scratch space.

In brief, we have 14 nodes with over 60 cores and 4 GPUs.

Getting an account and one-time password service

In order to get an account on the cluster, please send an email to Bruno (baolshausen AT berk...edu) with the following information:

   Full Name <emailaddress> desiredusername

Please also include a note about which PI you are working with. Note: the desireusername must be 3-8 characters long, so it would have been truncated to desireus in this case.

OTP (One Time Password) Service

Once you have a username, you will need to follow the instructions found [1] to set up the Google Authenticator application, which gives you a one-time password for logging into the cluster.

Directory setup

Home Directory Quota

There is a 10GB quota limit enforced on $HOME directory (/global/home/users/username) usage. Please keep your usage below this limit. There will be NETAPP snapshots in place in this file system so we suggest you store only your source code and scripts in this area and store all your data under /clusterfs/cortex (see below).

In order to see your current quota and usage, use the following command: TODO

 quota -s

Data

For large amounts of data, please create a directory

 /clusterfs/cortex/users/username

and store the data inside that directory. Note that unlike the home directory, scratch space is not backed up and permanence of your data is not guaranteed. There is a total limit of 17 TB for this drive that is shared by everyone at the Redwood Center.

Connect

ssh to a login node

 ssh -Y username@hpc.brc.berkeley.edu

and use your one-time password.

If you intend on working with a remote GUI session you can add a -C flag to the command above to enable compression data to be sent through the ssh tunnel.

note: please don't use the login nodes for computations (e.g. matlab, python)!

Google Authenticator App (get a password)

  • Open the google Authenticator App
  • Enter your personal pin
  • Enter the one-time pin

Setup environment

  • put all your customizations into your .bashrc
  • for login shells, .bash_profile is used, which in turn loads .bashrc

Using a Windows machine

Windows is not a Unix-based operating system and as a result does not natively interface with a Unix environment. Download the 2 following pieces of software to create a workaround:

  • Install a Unix environment emulator to interface directly with the cluster. Cygwin [2] seems to work well. During installation make sure to install Net -> "openssh". Editors -> "vim" is also recommended. Then you can use the instructions detailed in ssh to gateway above
  • Install an SFTP/SCP/FTP client to allow for file sharing between the cluster and your local machine. WinSCP [3] is recommended. ExpanDrive can also be used to create a cluster-based network drive on your local machine.

Useful commands

See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.

Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex

SLURM

SLURM is our scheduler. It is very important you understand SLURM well to have a good time doing research on the cluster. SLURM is our administrator on the cluster, it helps you find resources for your job. It also helps others do the same, so we are not stepping on each others' toes. There are some do's and don'ts with using SLURM.

  • Logging in -- when you login to the cluster, you end up landing on the login node. We do not own the login node and share this with other members of the Berkeley Research Consortium. So, it is important not to run anything here *at all*
  • Information on Submitting, Monitoring, Reviewing Jobs can be found here. You can do many simple BASH tricks to submit a large number of embarrassingly parallel jobs on the cluster. This is great for parameter sweeps.
  • Storage -- every user gets a 10 GB quota gratis from the BRC. This is your home folder or where you land when you login. In addition to this there's a 20TB scratch space (/clusterfs/cortex/scratch) shared by all members of the Redwood Center. We have a log of how much space is being used by each member who writes into the scratch folder at (TODO)
  • We have 4 GPU nodes and information on requesting and using them can be found here. When you request a GPU as a resource, you get the whole node along with it.
  • We have a debug queue that can be requested for research here


  • Submitting a Job

From the login node, you can submit jobs to the compute nodes using the syntax

 sbatch myscript.sh

where the myscript.sh is an shell script containing the call to the executable to be submitted to the cluster. Typically, for a matlab job, it would look like

 #!/bin/bash -l
 #SBATCH -p cortex
 #SBATCH --time=03:30:00
 #SBATCH --mem-per-cpu=2G
 cd /clusterfs/cortex/scratch/working/dir/for/your/code
 module load matlab/R2013a
 matlab -nodisplay -nojvm -r "mymatlabfunction( parameters); exit"
 exit

the --time defines the walltime of the job, which is an upper bound on the estimated runtime. The job will be killed after this time is elapsed. --mem specifies how much memory the job requires, the default is 1GB per job.

  • Monitoring Jobs

Additional options can be passed to sbatch to monitor outputs from the running jobs

   sbatch -o outputfile.txt -e errofile.txt -J jobdescriptor myscript.sh

the output of the job will be piped to outputfile.txt and any errors if the job crashes to errofile.txt

  • Cluster usage

Use

 squeue

to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.


To start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):

 srun -u -p cortex -t 2:0:0 --pty bash -i

Perceus commands

The perceus manual is here

  • listing available cluster nodes:
 wwstats
 wwnodes
  • list cluster usage
 wwtop
  • to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
 export NODES='*cortex'
  • module list
  • module avail
  • module help

Finding out the list of occupants on each cluster node

  • One can find out the list of users using a particular node by ssh into the node, e.g.
 ssh n0000.cortex
  • After logging into the node, type
 top
  • This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.

Job Management

In order to coordinate our cluster usage patterns fairly, our cluster uses a job manager known as SLURM. If your are planning to run jobs on the cluster you should be using SLURM! Learn how here.

Software

Information on what software is installed on the cluster and how to access it is here.

Matlab

Matlab instructions are here.

Python

Python instructions are here.

Usage Tips TODO

Here are some tips on how to effectively use the cluster.

Embarrassingly Parallel Submissions

Here is an alternate script to do embarrassingly parallel submissions on the cluster.

iterate.sh

 #!/bin/sh
 #Leap Size
 param1=11
 param2=1.2
 param3=.75
 #LeapSize
 for i in 14 15 16
 do
 #Epsilon
  for j in $(seq .8 .1 $param2);
      do
      #Beta
      for k in $(seq .65 .01 $param3);
            do
                echo $i,$j,$k
                qsub param_test.sh  -v "LeapSize=$i,Epsilon=$j,Beta=$k"
            done
      done
  done

param_test.sh

 #!/bin/bash
 #PBS -q cortex
 #PBS -l nodes=1:ppn=2:gpu
 #PBS -l walltime=10:35:00
 #PBS -o /global/home/users/mayur/Logs
 #PBS -e /global/home/users/mayur/Errors
 cd /global/home/users/mayur/HMC_reducedflip/
 module load matlab
 echo "Epsilon = ",$Epsilon
 echo "Leap Size = ",$LeapSize
 echo "Beta = ",$Beta
 matlab -nodisplay -nojvm -r "make_figures_fneval_cluster $LeapSize $Epsilon $Beta"
  Now run ./iterate.sh

Support Requests

  • If you have a problem that is not covered on this page, you can send an email to our user list:
 redwood_cluster@lists.berkeley.edu
  • If you need additional help from the LBL group, send an email to their email list. Please always cc our email list as well. Or visit their website[4].
 hpcshelp@lbl.gov
  • In urgent cases, you can also email Krishna Muriki (LBL User Services) directly.