Overview
High Performance Computing (HPC) is the aggregation of computing power and memory to perform complex calculations in parallel, increasing the speed and efficiency of computer simulations and data analysis. In 2018, a collaboration of faculty in the social and natural sciences and ITS staff successfully secured a $150,000 grant from the National Science Foundation to build Middlebury's first HPC cluster. Dubbed "Ada" in honor of Ada Lovelace, the famed 19th century mathematician, the cluster is a tool intended to support the research efforts of faculty who rely on access to expanded computing resources. We continue to add to our collaboration as resources become available.
This article describes the cluster structure and how to use it. The cluster is a shared resource, so we use queuing software (called Slurm) to manage job processing and to ensure fair access. Below are basic instructions for logging in to the cluster, accessing the queue and writing scripts to work efficiently and within best practices for a shared computing resource.
Cluster users must include an acknowledgement of NSF funding in any published research, as quoted below:
"This material is based upon work supported by the National Science Foundation under Grant No. 1827373.”
Please email the principal investigator, Professor Amy Yuen, with publication information for grant reporting purposes.
Access
A managing group of faculty and staff have developed (policies) for various types of users. All users must agree to these policies and submit this (form) before obtaining access. The working group periodically offers training sessions for students and faculty interested in learning how to access the cluster and work with the queueing software. Users may indicate interest in these training sessions using this (form).
Hardware
The HPC cluster consists of 20 nodes. There are 18 compute nodes with a cumulative total of 632 processors. It includes 16 nodes with 96GB of RAM each, one node with 256GB, and one node with 768GB. In addition, the HPC cluster has two dedicated graphics processing unit (GPU) nodes with 256GB and 96GB of RAM.
Software
Guidelines
Expectations and Support for Users
All HPC users will be expected to accept the standard Middlebury Code of Conduct relating to information and technology as well as a general set of best practices specific to the cluster. These are included in this HPC article. Additionally, faculty who have little or no experience using a shared computing cluster are strongly urged to participate in the periodic training sessions offered by ITS staff and HPC affiliated faculty.
Cluster Use Principles
The use of the Ada cluster is governed by all the policies that apply to Middlebury’s Information Technology (http://www.middlebury.edu/about/handbook/policies-for-all/appropriate-use/info-tech) and the following principles:
- The Ada cluster supports the research and educational missions of Middlebury College. Users agree to only run computational jobs related to those missions. For example, cryptocurrency mining for financial gain or commercial use of the cluster is not appropriate.
- The Ada cluster is a shared resource. Running computations that consume large portions of the cluster for extended periods (including consuming large portions of the available disk space) could prevent others from using this community resource. Exercise care in how you use the Ada cluster to be respectful of other community members’ interest in using the system.
- You are entirely responsible for any data you place on the cluster. You agree that your data management practices are in accordance with Middlebury’s policies and any applicable regulations or agreements, e.g. HIPAA, data use agreements, etc.
- The Ada cluster is intended for data analysis, not data storage. Data is not backed up. Data that is no longer needed should be promptly deleted to ensure there is sufficient disk space for everyone.
- You agree to respect the privacy of other users, e.g. by not exploring directories owned by other users even if those directories are accessible to you.
- You are expected to report any security incidents or abuse to ITS immediately. Examples of security incidents include but are not limited to: unauthorized access or use, compromised accounts -including “shared” login credentials, and misuse of data.
Users whose behavior runs counter to these principles may be asked by cluster administrators to leave the cluster.
Training
Logging in
- You must be behind the Middlebury firewall to login to ada
- You can login via ssh (Secure Shell):
ssh username@ada
- "username" is your Middlebury username. If your username on the computer you're logging in from is also your Midd username (e.g. if you're using a college owned computer), then you can just use the command ("ssh ada").
- You will be prompted for your Middlebury password--after you enter your password, you will now have a linux command prompt for the head node "ada".
- You are now in your home directory on ada. From here you can access the filesystem in your home directory, using standard linux commands. For example, we can make a directory:
mkdir test_job
- While it's not necessary, for convenience you can consider setting up public key authentication from your laptop or desktop; this will allow you to log in securely without entering your password.
Editing and otherwise working with files
There are a number of approaches to working with your files on ada. Some examples include:
- Connect via SSH as described above and edit any files in the terminal
- Use an SSH extension within your editor to edit files on the remote machine as though they were local. An example is Visual Studio Code with the Remote - SSH extension. In this model all files are stored on ada but the editor appears to run locally.
- Use a third party tool, like MobaXterm for Windows or SSHFS for OSX/Linux that makes ada appear to be a network drive
- Develop locally and copy your files from your personal computer using rsync or other similar command
Submitting jobs via the Slurm scheduler
To run jobs on the cluster, you must submit them via script to the [https://slurm.schedmd.com slurm] scheduler. A summary of slurm commands and options can be found here.
Basic slurm script
- We have the basic slurm script shown below in the text file "slurm_serial.sh":
#!/usr/bin/env bash # slurm template for serial jobs
# Set SLURM options
#SBATCH --job-name=serial_test # Job name
#SBATCH --output=serial_test-%j.out # Standard output and error log #SBATCH --mail-user=username@middlebury.edu # Where to send mail #SBATCH --mail-type=NONE # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mem=100mb # Job memory request
#SBATCH --partition=standard # Partition (queue)
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
# print SLURM envirionment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Node: ${SLURMD_NODENAME}" echo "Starting: "`date +"%D %T"`
# Your calculations here
printf "\nHello world from ${SLURMD_NODENAME}!\n\n"
# End of job info
echo "Ending: "`date +"%D %T"`
- A list of environment variables that can be configured in the slurm submit script is here.
Submitting jobs
- Jobs are submitted to the slurm scheduler via the sbatch command:
sbatch slurm_serial.sh
- A list of options for sbatch can be found here.
Monitoring jobs
- You can monitor the status of jobs in the queue via the squeue command:
squeue
- You can review which nodes are assigned to which queues and which nodes are idle via the sinfo command:
sinfo
Parallel Jobs
Array jobs
If a serial job can easily broken into several (or many) independent pieces, then it's most efficient to submit an array job, which is a set of closely related serial jobs that will all run independently.
- To submit an array job, use the slurm option "--array". For example "--array=0-4" will run 5 independent tasks, labeled 0-4 by the environment variable SLURM_ARRAY_TASK_ID.
- To allow each array task to perform a different calculation, you can to use SLURM_ARRAY_TASK_ID as an input parameter to your calculation.
- Each array task will appear as an independent job in the queue and run independently.
- An entire array job can be canceled at once or each task can be canceled individually.
Here is a simple example of a slurm array job script:
#!/usr/bin/env bash
# slurm template for array jobs
# Set SLURM options
#SBATCH --job-name=array_test # Job name
#SBATCH --output=array_test-%A-%a.out # Standard output and error log
#SBATCH --mail-user=username@middlebury.edu # Where to send mail
#SBATCH --mail-type=NONE # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mem=100mb # Job memory request
#SBATCH --partition=standard # Partition (queue)
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --array=0-4 # Array range
# print SLURM envirionment variables
echo "Job ID: ${SLURM_JOB_ID}"
echo "Array ID: ${SLURM_ARRAY_TASK_ID}"
echo "Node: ${SLURMD_NODENAME}"
echo "Starting: "`date +"%D %T"`
# Your calculations here
printf "\nHello world from array task ${SLURM_ARRAY_TASK_ID}!\n\n"
# End of job info
echo "Ending: "`date +"%D %T"`
An example of how a serial job can be broken into an array job is on the HPC Github repository (see below).
Shared memory or multi-threaded jobs
If your code can take advantage of multiple CPU cores via multi-threading, you can request multiple CPU cores on a single node for your job in the slurm script via the "--cpus-per-task" option. For example specifying:
#SBATCH --cpus-per-task=8 # Number of CPU cores for this job
in the slurm script would request 8 CPU cores for the job. The standard CPU compute nodes have 36 cores per node, so you can request up to 36 cores per job. All cores will be on the same node and share memory, as if the calculation was running on a single stand alone workstation.
Note that your code must be able to take advantage of the additional CPU cores that slurm allocates--if you request multiple cores for a purely serial code (i.e. that can only use 1 CPU core) the additional CPU cores will remain idle.
An example of shared memory parallelization is available in the GitHub repository in the "multithread-example" directory.
Multi-node (MPI) jobs
The cluster uses the OpenMPI implememtation to run multi-node (MPI) jobs. Only programs that are specifically compiled to use MPI are able to run in parallel on multiple nodes.
1. Load the Open MPI module (our installation uses the GCC compiler):
module load openmpi3
2. Compile the file using the appropriate Open MPI compiler wrapper
C:
C++:
Fortran:
For example, to compile a C file:
mpicc <filename>.c -o <filename>
3. Run an Open MPI script:
mpirun ./<filename>
where <filename> is the name of the executable created when the script was compiled. mpirun takes command-line arguments such as -N 5, which would request 5 nodes to run the executable on.
GPU jobs
There is a single GPU compute node (with 4 GPUs) which is accessible via the gpu-standard, gpu-short, and gpu-long queues. All GPU jobs must be submitted to one of these queues via the --partition option. You should also specify the number of GPUs your job will use via the --gres option (short for "Generic Resources"). For example, to use one GPU:
#SBATCH --partition=gpu-standard # Partition (queue)
#SBATCH --gres=gpu:1 # Number of GPUs
By setting the --gres option, Slurm will configure your job to use a specific GPU(s), enabling multiple jobs/users to run concurrently on the GPU node (each configured to use a different GPU). To your program it will only look like a single GPU(s) is available.
Note that the GPU node has fewer CPU cores (16) than the other nodes so make sure to set your --cpus-per-task options differently for this node.
Large Memory jobs
Standard CPU compute nodes have a total of 96 GB of RAM, so you can request up to 96 GB for jobs submitted to the standard, short or long queues. In your slurm submit script you should specify the amount of memory needed via the --mem option. For example, include the line:
#SBATCH --mem=2gb
to request 2gb for a job. If your job requires more than 96GB of RAM, you will need to use the high memory node, which has 768 GB of RAM. To access the high memory node you need to submit to the himem-standard, himem-short, himem-long queues for example, including the options:
#SBATCH --partition=himem-standard # Partition (queue)
#SBATCH --mem=128gb # Job memory request
would request 128GB of RAM using the himem-standard queue.
Storage
- Each user has a home directory located at /home/$USER where $USER is your Middlebury username, and also accessible via the $HOME environment variable. Each user has a quote of 50 GB in their home directory/
- Additionally each user has a storage directory located /storage/$USER which is also accessible via the $STORAGE environment variable. The quota on each user's storage directory is 400 GB.
- The home directory has a fairly small quota as it is only intended for storage of scripts, code, executables, and small parameter files, NOT for data storage.
- Data files should be stored on in the storage directory.
Local scratch storage
Home and storage directories are located on separate nodes (the head node and storage nodes) and only mounted remotely to each compute node via ethernet. For jobs that need to frequently read/write significant amounts of data to disk, it may be advantageous to read/write to the local scratch space on each compute node which will be much faster to access.
Local scratch directories for each user are available at /local/$USER which is stored in the $SCRATCH.
Checkpointing
Checkpointing your jobs running on ada is recommended. Checkpointing stores the internal state of your calculation periodically so the job can be restarted from that state, e.g. if the node goes down or the wall clock limit is reached. Ideally, checkpointing is done internally in your application (it is built into many open source and commercial packages); if your application doesn't support checkpointing internally you can use an external checkpointing tool such as dmtcp. Here we'll illustrate an example of using external checkpointing via dmtcp found the directory "ckpt-example" on the GitHub repository.
*We'll illustrate checkpointing using a simple counter. First compile the executable "count" from the source code "counter.c" via:
gcc counter.c -o count
Now you should see the executable file "count". Take a look at the slurm script slurm-ckpt-start.sh. The key line is:
timeout 15 dmtcp_launch --no-coordinator -p 0 -i 10 ./count
- "timeout" is a standard linux utility that will automatically stop whatever command that follows; the "15" is the length of time before the process is killed in seconds. You can also use units of days and hours, eg. "timeout 47h". Timeout is not necessary for checkpointing, but it lets you stop your job before the wall clock limit is reached and slurm kills your job.
- "dmtcp_launch" is the command to start running your executable (in this case count) through the dmtcp checkpointing tool. We suggest you always use the "--no-coordinator -p 0" options to avoid interference with other jobs.
- The "-i" option sets the frequency that dmtcp will store the state of you process to a checkpoint file. "-i 10" checkpoints the file every 10 seconds--much more frequently than you would ever want to do in practice (this is just so the example goes quickly). More reasonable for an actual job would be "-i 3600" to checkpoint once an hour.
- In practice, the checkpointing syntax for "your_executable", might be something like:
timeout 47h dmtcp_launch --no-coordinator -p 0 -i 3600 your_executable
- Now submit the slurm script "slurm_ckpt_start.sh"
sbatch slurm-ckpt-start.sh
- Once that job has completed, you should see a checkpointing file of the form "ckpt_count_*.dmtcp". You job can be restarted using the "dmtcp_restart" command as is found in "slurm_ckpt_restart.sh":
sbatch slurm-ckpt-restart.sh
- You can restart and continue the job any number of times via the same restart script. E.g. try submitting the restart script a 2nd time.
sbatch slurm-ckpt-restart.sh
Sample jobs
Breaking a serial job into an array job
An example of using array jobs is in the directory "array_job_example" on the HPC Github repository
- The python script factor_list.py will find the prime factors of a list of integers, e.g. the 12-digit numbers in the file "sample_list_12.dat":
python factor_list.py sample_list_12.dat
- To factor all 20 16-digit numbers in "sample_list_12.dat" as a single serial job (which will take several minutes), submit the slurm script "serial_factor.sh":
sbatch serial_factor.sh
- The factors will be stored in "serial_factors_out.dat"
- The slurm script "array_factor.sh" breaks the calculation up into a 10 task array job:
sbatch array_factor.sh
- Each array task stores the results in the file "array_factors_out-${SLURM_ARRAY_TASK_ID}.dat" where the task array ID runs from 0-9.
- After all the array tasks are complete, the data can combined into a single file , e.g. array_factors_out.dat:
cat array_factors_out-?.dat > array_factors_out.dat
- You can check that both methods give you the same result via diff:
diff serial_factors_out.dat array_factors_out.dat
Serial Stata job
The primary difference between using Stata on the cluster and using Stata on your computer is learning how to run Stata in batch mode, that is, non-interactively. To use Stata on the cluster, you will need a shell script (*.sh) that inserts your Stata process into the Slurm queue and runs your Stata do file from the command line. You need basic Unix command skills, basic Slurm syntax and a Stata do file.
You can log in to MIddlebury's HPC repository at Github to see executable examples of both a serial Stata job and a parallel Stata job in the "Stata-examples" directory. A serial Stata job is the simplest, using a single processor on a single node to execute your calculations. Most Stata users will need to use the parallel computing capabilities if they need to use the cluster to perform their calculations. Both the serial and parallel computing examples use "stata_auto.do" as the sample do file, so be sure to download it as well. Copy the shell script and do file to your home directory on Ada. The command to run the serial shell script is:
sbatch stata_serial.sh
Parallel Stata job
Because we are using Stata MP (multiprocessor), the program already has built-in multiprocessor capabilities. Our license allows us to use up to 16 processors. Stata will automatically use as many processors as it can "see", which is where the specifications in Slurm (the queuing software) are important. There is a single difference between the serial job syntax and the parallel job syntax for Stata, and that is to change "#SBATCH --cpus-per-task=1" to "#SBATCH --cpus-per-task=16" in the shell script, which tells Stata there are 16 computing processors available (see the above section on "Shared_memory_or_multi-threaded_jobs").
Copy the example script and do file to your home directory on Ada and type to following command:
sbatch stata_parallel.sh
Modules
Ada uses Environment modules to manage specialized software. Modules are short scripts that automatically configure your environment (i.e. set the PATH
and other environment variables). You can view the available modules via the command:
module avail
Modules can be loaded via "module load", e.g.
module load python/anaconda2
If your job needs a module which is not loaded by default, you must load the appropriate module in your slurm submit script.
The python/anaconda2
module is the anaconda2
file in the python
directory. You can create your own module files by creating a directory to contain your modules files, e.g. modulefiles
, and then subdirectories for each program and module files for each version. The use
command will add your modules files to the modules search path, e.g.
module use $HOME/modulefiles
You will need to execute the use
command every time you log in. Doing so at every login is tedious, so instead you add the use
command to your .bash_profile
file.
Git repository
Sample slurm scripts and example jobs are availing in the GitHub repository:
https://github.com/middlebury/HPC
You can clone a copy of this repository to your home directory (or elsewhere) via the command:
git clone https://github.com/middlebury/HPC.git
Best practices
- Do NOT run calculations on the head node! All calculations need to be submitted to the scheduler via slurm.
- Data files should be stored in the $STORAGE directory, not $HOME.
- When possible, array jobs should be used when calculations can be split into independent pieces.
- Checkpoint your jobs either internally, or externally via dmtcp.
- Only request the memory you'll actually use (with a buffer for room for error).
- Use the $SCRATCH directory for frequent read/writes during the calculation.