The Redwood computing cluster consists of about a dozen somewhat heterogeneous machines, some with graphics cards (GPUs), and one very clever wombat who can optimize your neural network for you if you ask nicely. The typical use cases for the cluster are that you have jobs that run in parallel which are independent, so having several machines will complete the task faster, even though any one machine might not be faster than your own laptop. Or you have a long running job which may take a day, and you don't want to worry about having to leave your laptop on at all times and not be able to use it. Another reason is that your code leverages a communication scheme (such as MPI) to have multiple machines cooperatively work on a problem. Lastly, if you want to do long GPU computations.
In order for the cluster to be useful and well-utilized, it works best for everyone to submit jobs TODO (see SLURM further down on this page for the details) to the queue. A job may not start right away, but will get run once its turn comes. Please do not run extended interactive sessions or ssh directly to worker nodes for performing computation.
The current hardware and node configuration is listed here.
In addition to the compute nodes we own a file server TODO
which is mounted as scratch space.
In brief, we have 14 nodes with over 60 cores and 4 GPUs.
Getting an account and one-time password service
In order to get an account on the cluster, please send an email to Bruno (baolshausen AT berk...edu) with the following information:
Full Name <emailaddress> desiredusername
Please also include a note about which PI you are working with. Note: the desireusername must be 3-8 characters long, so it would have been truncated to desireus in this case.
OTP (One Time Password) Service
Once you have a username, you will need to follow the instructions found here to set up the Pledge application, which gives you a one-time password for logging into the cluster (see Installing and Configuring the OTP Token).
Home Directory Quota
There is a 10GB quota limit enforced on $HOME directory (/global/home/users/username) usage. Please keep your usage below this limit. There will be NETAPP snapshots in place in this file system so we suggest you store only your source code and scripts in this area and store all your data under /clusterfs/cortex (see below).
In order to see your current quota and usage, use the following command: TODO
For large amounts of data, please create a directory
and store the data inside that directory. Note that unlike the home directory, scratch space is not backed up and permanence of your data is not guaranteed. There is a total limit of 4 TB for this drive that is shared by everyone at the Redwood center.
Pledge App (get a password)
- Run the pledge app and click "Generate one-time password"
- Enter your PIN and press "Enter"
- The application will present your 7 digit one time password
ssh to a login node
ssh -Y firstname.lastname@example.org
and use your one-time password.
If you intend on working with a remote GUI session you can add a -C flag to the command above to enable compression data to be sent through the ssh tunnel.
note: please don't use the login nodes for computations (e.g. matlab, python)!
- put all your customizations into your .bashrc
- for login shells, .bash_profile is used, which in turn loads .bashrc
Using a Windows machine
Windows is not a Unix-based operating system and as a result does not natively interface with a Unix environment. Download the 2 following pieces of software to create a workaround:
- Install a Unix environment emulator to interface directly with the cluster. Cygwin  seems to work well. During installation make sure to install Net -> "openssh". Editors -> "vim" is also recommended. Then you can use the instructions detailed in ssh to gateway above
- Install an SFTP/SCP/FTP client to allow for file sharing between the cluster and your local machine. WinSCP  is recommended. ExpanDrive can also be used to create a cluster-based network drive on your local machine.
See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.
Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex
SLURM is our scheduler. It is very important you understand SLURM well to have a good time doing research on the cluster. SLURM is our administrator on the cluster, it helps you find resources for your job. It also helps others do the same, so we are not stepping on each others' toes. There are some do's and don'ts with using SLURM.
- Logging in -- when you login to the cluster, you end up landing on the login node. We do not own the login node and share this with other members of the Berkeley Research Consortium. So, it is important not to run anything here *at all*
- Information on Submitting, Monitoring, Reviewing Jobs can be found here. You can do many simple BASH tricks to submit a large number of embarrassingly parallel jobs on the cluster. This is great for parameter sweeps.
- Storage -- every user gets a 10 GB quota gratis from the BRC. This is your home folder or where you land when you login. In addition to this there's a 20TB scratch space (/clusterfs/cortex/scratch) shared by all members of the Redwood Center. We have a log of how much space is being used by each member who writes into the scratch folder at (TODO)
- We have 4 GPU nodes and information on requesting and using them can be found here. When you request a GPU as a resource, you get the whole node along with it.
- We have a debug queue that can be requested for research here
- Submitting a Job
From the login node, you can submit jobs to the compute nodes using the syntax
where the myscript.sh is an shell script containing the call to the executable to be submitted to the cluster. Typically, for a matlab job, it would look like
#!/bin/bash -l #SBATCH -p cortex #SBATCH --time=03:30:00 #SBATCH --mem-per-cpu=2G cd /clusterfs/cortex/scratch/working/dir/for/your/code module load matlab/R2013a matlab -nodisplay -nojvm -r "mymatlabfunction( parameters); exit" exit
the --time defines the walltime of the job, which is an upper bound on the estimated runtime. The job will be killed after this time is elapsed. --mem specifies how much memory the job requires, the default is 1GB per job.
- Monitoring Jobs
Additional options can be passed to sbatch to monitor outputs from the running jobs
sbatch -o outputfile.txt -e errofile.txt -J jobdescriptor myscript.sh
the output of the job will be piped to outputfile.txt and any errors if the job crashes to errofile.txt
- Cluster usage
to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.
To start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):
srun -u -p cortex -t 2:0:0 --pty bash -i
The perceus manual is here
- listing available cluster nodes:
- list cluster usage
- to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
- module list
- module avail
- module help
- help pages are here
Finding out the list of occupants on each cluster node
- One can find out the list of users using a particular node by ssh into the node, e.g.
- After logging into the node, type
- This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.
ClusterAdmin has information about cluster administration.
In order to coordinate our cluster usage patterns fairly, our cluster uses a job manager known as SLURM. If your are planning to run jobs on the cluster you should be using SLURM! Learn how here.
Information on what software is installed on the cluster and how to access it is here.
Matlab instructions are here.
Python instructions are here.
Usage Tips TODO
Here are some tips on how to effectively use the cluster.
Embarrassingly Parallel Submissions
Here is an alternate script to do embarrassingly parallel submissions on the cluster.
#!/bin/sh #Leap Size param1=11 param2=1.2 param3=.75 #LeapSize for i in 14 15 16 do #Epsilon for j in $(seq .8 .1 $param2); do #Beta for k in $(seq .65 .01 $param3); do echo $i,$j,$k qsub param_test.sh -v "LeapSize=$i,Epsilon=$j,Beta=$k" done done done
#!/bin/bash #PBS -q cortex #PBS -l nodes=1:ppn=2:gpu #PBS -l walltime=10:35:00 #PBS -o /global/home/users/mayur/Logs #PBS -e /global/home/users/mayur/Errors cd /global/home/users/mayur/HMC_reducedflip/ module load matlab echo "Epsilon = ",$Epsilon echo "Leap Size = ",$LeapSize echo "Beta = ",$Beta matlab -nodisplay -nojvm -r "make_figures_fneval_cluster $LeapSize $Epsilon $Beta"
Now run ./iterate.sh
- If you have a problem that is not covered on this page, you can send an email to our user list:
- If you need additional help from the LBL group, send an email to their email list. Please always cc our email list as well. Or visit their website.
- In urgent cases, you can also email Krishna Muriki (LBL User Services) directly.