Difference between revisions of "Cluster"

From RedwoodCenter
Jump to navigationJump to search
Line 222: Line 222:
   module unload intel
   module unload intel
Theano caches certain compiled libraries and these will sometimes cause errors when Theano gets updated. If you are experiencing problems with Theano, you can try clearing the cache with
  theano-cache clear
and if you still have problems you can delete the .theano folder from your home directory.
==== Using the GPU ====
==== Using the GPU ====

Revision as of 17:07, 14 August 2015

General Information

The Redwood computing cluster consists of about a dozen somewhat heterogeneous machines, some with graphics cards (GPUs). The typical use cases for the cluster are that you have jobs that run in parallel which are independent, so having several machines will complete the task faster, even though any one machine might not be faster than your own laptop. Or you have a long running job which may take a day, and you don't want to worry about having to leave your laptop on at all times and not be able to use it. Another reason is that your code leverages a communication scheme (such as MPI) to have multiple machines cooperatively work on a problem.

In order for the cluster to be useful and well-utilized, it works best for everyone to submit jobs TODO (see qsub further down on this page for the details) to the queue. A job may not start right away, but will get run once its turn comes. Please do not run extended interactive sessions or ssh directly to worker nodes for performing computation.

ClusterAdmin has information about cluster administration.

Hardware Overview

The current hardware and node configuration is listed here.

In addition to the compute nodes we own a file server TODO

 NetOp 4TB

which is mounted as scratch space.

Getting an account and one-time password service

In order to get an account on the cluster, please send an email to Bruno (baolshausen AT berk...edu) with the following information:

   Full Name <emailaddress> desiredusername

Please also include a note about which PI you are working with. Note: the desireusername must be 3-8 characters long, so it would have been truncated to desireus in this case.

OTP (One Time Password) Service

Once you have a username, you will need to follow the instructions found here to set up the Pledge application, which gives you a one-time password for logging into the cluster (see Installing and Configuring the OTP Token).

Directory setup

Home Directory Quota

There is a 10GB quota limit enforced on $HOME directory (/global/home/users/username) usage. Please keep your usage below this limit. There will be NETAPP snapshots in place in this file system so we suggest you store only your source code and scripts in this area and store all your data under /clusterfs/cortex (see below).

In order to see your current quota and usage, use the following command: TODO

 quota -s


For large amounts of data, please create a directory


and store the data inside that directory. Note that unlike the home directory, scratch space is not backed up and permanence of your data is not guaranteed. There is a total limit of 4 TB for this drive that is shared by everyone at the Redwood center.


Pledge App (get a password)

  • Run the pledge app and click "Generate one-time password"
  • Enter your PIN and press "Enter"
  • The application will present your 7 digit one time password

ssh to a login node

 ssh -Y username@hpc.brc.berkeley.edu

and use your one-time password.

If you intend on working with a remote GUI session you can add a -C flag to the command above to enable compression data to be sent through the ssh tunnel.

note: please don't use the login nodes for computations (e.g. matlab, python)!

Setup environment

  • put all your customizations into your .bashrc
  • for login shells, .bash_profile is used, which in turn loads .bashrc

Using a Windows machine

Windows is not a Unix-based operating system and as a result does not natively interface with a Unix environment. Download the 2 following pieces of software to create a workaround:

  • Install a Unix environment emulator to interface directly with the cluster. Cygwin [1] seems to work well. During installation make sure to install Net -> "openssh". Editors -> "vim" is also recommended. Then you can use the instructions detailed in ssh to gateway above
  • Install an SFTP/SCP/FTP client to allow for file sharing between the cluster and your local machine. WinSCP [2] is recommended. ExpanDrive can also be used to create a cluster-based network drive on your local machine.

Useful commands

See https://sites.google.com/a/lbl.gov/high-performance-computing-services-group/scheduler/ucb-supercluster-slurm-migration for a detailed FAQ on the SLURM job manager.

Full description of our system by the LBL folks is at http://go.lbl.gov/hpcs-user-svcs/ucb-supercluster/cortex

SLURM usage

  • Submitting a Job

From the login node, you can submit jobs to the compute nodes using the syntax

 sbatch myscript.sh

where the myscript.sh is an shell script containing the call to the executable to be submitted to the cluster. Typically, for a matlab job, it would look like

 #!/bin/bash -l
 #SBATCH -p cortex
 #SBATCH --time=03:30:00
 #SBATCH --mem-per-cpu=2G
 cd /clusterfs/cortex/scratch/working/dir/for/your/code
 module load matlab/R2013a
 matlab -nodisplay -nojvm -r "mymatlabfunction( parameters); exit"

the --time defines the walltime of the job, which is an upper bound on the estimated runtime. The job will be killed after this time is elapsed. --mem specifies how much memory the job requires, the default is 1GB per job.

  • Monitoring Jobs

Additional options can be passed to sbatch to monitor outputs from the running jobs

   sbatch -o outputfile.txt -e errofile.txt -J jobdescriptor myscript.sh

the output of the job will be piped to outputfile.txt and any errors if the job crashes to errofile.txt

  • Cluster usage



to get a list of pending and running jobs on the cluster. It will show user names jobdescriptor passed to sbatch, runtime and nodes.

Perceus commands

The perceus manual is here

  • listing available cluster nodes:
  • list cluster usage
  • to restrict the scope of these commands to cortex cluster, add the following line to your .bashrc
 export NODES='*cortex'
  • module list
  • module avail
  • module help

Finding out the list of occupants on each cluster node

  • One can find out the list of users using a particular node by ssh into the node, e.g.
 ssh n0000.cortex
  • After logging into the node, type
  • This is useful if you believe someone is abusing the machine and would like to send him/her a friendly reminder.



Start an interactive session on the cluster (requires specifying the cluster and walltime as is shown here):

 srun -u -p cortex -t 2:0:0 --pty bash -i

In order to use matlab, you have to load the matlab environment:

 module load matlab/R2013a

Once the matlab environment is loaded, you can start a matlab session by running

 matlab -nodesktop

An example SLURM script for running matlab code is

 #!/bin/bash -l
 #SBATCH -p cortex
 #SBATCH --time=03:30:00
 #SBATCH --mem-per-cpu=2G
 module load matlab/R2013a
 matlab -nodesktop -r "scriptname. $variable1 $variable2"

The above script takes a matlab job with scriptname = scriptname and accepts two variables $variable1 and $variable2

If you would like to see who is using matlab licenses, enter



Anaconda Python Distribution

The Anaconda Python 2.7 or 3.4 Distributions can be loaded through

 module load python/anaconda2/anaconda2


 module load python/anaconda3/anaconda3

respectively. This distribution has NumPy and SciPy built against the Intel MKL BLAS library (multicore BLAS). You will need to get an academic license from Continuum and copy it to the cluster.

On the cluster

 mkdir .continuum

On the machine where you downloaded the license file

 scp file_name username@hpc.brc.berkeley.edu:/global/home/users/username/.continuum/.

Local Install of Anaconda Python Distribution

If you want to manage your own python distribution the Anaconda Python is a very good distribution. To get it, go the the Continuum downloads page and select the linux distribution (penguin). Copy the download link address, and then in a terminal on the cluster run:

 wget paste_link_here

This should download a .sh file that can be run with

 bash Anaconda-version_info.sh


CUDA is a library to use the graphics processing units (GPU) on the graphics card for general-purpose computing. We have a separate wiki page to collect information on how to do general-purpose computing on the GPU: GPGPU. The --constraint={cortex_k40, cortex_fermi} option must be used in order to schedule a node with a GPU. We have installed the CUDA 6.5 driver and toolkit.

In order to use CUDA, you have to load the CUDA environment:

 module load cuda

Using Theano

By default, Theano expects the default compiler to be gcc, so you'll need to unload the intel compiler.

 module unload intel

Theano caches certain compiled libraries and these will sometimes cause errors when Theano gets updated. If you are experiencing problems with Theano, you can try clearing the cache with

 theano-cache clear

and if you still have problems you can delete the .theano folder from your home directory.

Using the GPU

You must request a GPU node. The Anaconda Python distribution comes with a version of Theano that should work. If you need new Theano features, the development version of Theano can be obtained from the github repository, installed locally, and added to your PYTHONPATH if you are using the preinstalled Python verions. If you have a local python install you can install theano with

 python setup.py develop

from the repository folder. Theano must be configured to use the GPU. General information can be found in the Theano documentation, but a working (June 2015) version is to create a .theanorc file in your HOME directory with the contents:

 root = /global/software/sl-6.x86_64/modules/langs/cuda/6.5/
 device = gpu
 floatX = float32
 fastmath = True

Using the CPU

Theano can also run on the CPU. Any of the CPU nodes will work. You will want to have Theano build against the MKL BLAS library that comes with Anaconda and so your .theanorc might look like

 device = cpu
 floatX = float32
 ldflags = -lmkl_rt

Obtain GPU lock in python

If you would like to use one of the GPU cards on node n0000 or n0001, please obtain a GPU lock to make sure the card is not in use and that no one else will be using the card.

If you are using Python, you can obtain a GPU lock by running

 import gpu_lock

The function either returns the number of the card you can use (0 or 1) or -1 if both cards are in use.

Obtain GPU lock for Jacket in Matlab

If you are using Matlab, you can obtain a GPU lock by running

 gpu_id = obtain_gpu_lock_id();

By default, obtain_gpu_lock() throws an error when all gpu cards are taken. There is another option: obtain_gpu_lock_id(true) will return -1 in case there is no card available and you can then write your own code to deal with that fact.

ginfo tells you which gpu card you are using.

The following lines should also be in your .bashrc

 ## jacket stuff!
 module load cuda
 export LD_LIBRARY_PATH=/clusterfs/cortex/software/jacket/engine/lib64:$LD_LIBRARY_PATH

Usage Tips TODO

Here are some tips on how to effectively use the cluster.

Embarrassingly Parallel Submissions

Here is an alternate script to do embarrassingly parallel submissions on the cluster.


 #Leap Size
 for i in 14 15 16
  for j in $(seq .8 .1 $param2);
      for k in $(seq .65 .01 $param3);
                echo $i,$j,$k
                qsub param_test.sh  -v "LeapSize=$i,Epsilon=$j,Beta=$k"


 #PBS -q cortex
 #PBS -l nodes=1:ppn=2:gpu
 #PBS -l walltime=10:35:00
 #PBS -o /global/home/users/mayur/Logs
 #PBS -e /global/home/users/mayur/Errors
 cd /global/home/users/mayur/HMC_reducedflip/
 module load matlab
 echo "Epsilon = ",$Epsilon
 echo "Leap Size = ",$LeapSize
 echo "Beta = ",$Beta
 matlab -nodisplay -nojvm -r "make_figures_fneval_cluster $LeapSize $Epsilon $Beta"
  Now run ./iterate.sh

Mounting Cluster File System

Mounting the cluster file system remotely allows you to easily access files on the cluster, and allows you to use local programs to edit code or examine simulation outputs locally (very useful). I often edit the remote code using a text editor running on my local machine. This allows you to take advantage of the niceties of a native editor without having to copy code back and forth before you run a simulation on the cluster.

On linux distributions you can mount your cluster home directory locally using sshfs [3]

 sshfs hadley.berkeley.edu: <mount-dir>

On Mac and Windows machines the program ExpanDrive works well (uses Fuse under the hood): [4]

Support Requests

  • If you have a problem that is not covered on this page, you can send an email to our user list:
  • If you need additional help from the LBL group, send an email to their email list. Please always cc our email list as well. Or visit their website[5].
  • In urgent cases, you can also email Krishna Muriki (LBL User Services) directly.