SLURM Scheduler and batch system#
SLURM is a popular job scheduler that is used to allocate, manage and monitor jobs on an HPC platform. This chapter introduces you to the basics of submitting jobs to a SLURM scheduler.
What is a batch/scheduler system?#
A batch system is a piece of software that deals with collecting job instructions and allocating or scheduling resources to run the tasks. It allows users to submit workloads in the form of jobs, which are then executed automatically and asynchronously on available computing resources. Batch systems generally work by managing a queue of pending jobs and allocating resources based on policies and priorities set by administrators. They also provide features for monitoring, job accounting, and reporting. Batch systems help maximize the utilization of computing resources and improve the efficiency of HPC environments by allowing multiple jobs to run concurrently and by efficiently allocating resources to jobs based on their needs.
What does a batch scheduler do?#
The three key tasks of a job scheduler are:
Understanding what resources are available on the cluster: how many compute nodes are available? what size are those compute nodes? what jobs are currently running on them? etc.
Queuing and allocating jobs to run on compute nodes based on the resources available and the resources specified in the job script. If you submit a job that requests 4 vCPUs for 1 task, Slurm will add the job to the queue and wait for the next available compute node with 4 vCPUs to run it there. Similarly, if you submit a job that requests GPUs or specific memory allocation etc.
Monitoring and reporting the status of jobs: which jobs are in the queue? which jobs are running? which jobs failed? which jobs were completed successfully? which nodes are available? etc.
One of the biggest advantages of running a cluster in the cloud is the ability to easily scale up or scale down the size of your cluster as needed. This not only enables you to power through your jobs more quickly but also provides a great cost-saving benefit where you only pay for compute nodes when they are actively running jobs.
Note
AWS ParallelCluster AWS ParallelCluster integrate with Slurm: it monitors the Slurm queues to determine when to request more compute nodes, or release compute nodes that are no longer needed. For example, when current compute nodes are busy with other jobs but there are more jobs waiting for resources in the queue, ParallelCluster will assess how many compute nodes are required to run those jobs and add additional compute nodes to the cluster (up to the maximum number specified), and will then remove compute nodes once all running jobs are complete and there are no remaining jobs in the queue. This process of scaling the size of your cluster up and down by adding and removing compute nodes as required is referred to as “Auto Scaling”.
Common batch systems#
There are several commonly used batch systems on HPCs, each with its own strengths and weaknesses. Some of the most popular batch systems include:
Slurm: This is a widely used open-source batch system that provides excellent scalability and job management capabilities.
PBS/Torque: PBS (Portable Batch System) and its open-source variant Torque are widely used in HPC and scientific computing environments.
LSF: IBM’s Load Sharing Facility is a commercial batch system that offers good scalability and scheduling policies.
SGE/OpenGridScheduler: Sun Grid Engine (SGE) is an open-source batch system that is widely used in the academic and research community.
HTCondor: HTCondor is an open-source job scheduling system that is used primarily in high-throughput computing environments.
There are many other batch systems available as well, and the choice of which batch system to use will typically depend on the specific needs and requirements of the organization or research group.
SLURM#
SLURM is an open-source workload manager Slurm (Simple Linux Utility for Resource Management). To run test or production jobs, submit a job script (see below) to Slurm, which will find and allocate the resources required for your job (e.g. the compute nodes to run your job on).
By default, there is a limited number of jobs you can run at once. For instance, it is set to 8 on Raven, the default job submit limit is 300. If your batch jobs can’t run independently from each other, you need to use job steps.
There are mainly two types of batch jobs:
Exclusive, where all resources on the nodes are allocated to the job
Shared, where several jobs share the resources of one node. In this case, the number of CPUs and the amount of memory must be specified for each job.
You can find the MPCDF introduction on their documentation page.
Slurm main directives#
Slurm directives are small instructions to tell Slurm how to allocate your jobs to the cluster (i.e. across how many compute nodes, with how many vCPUs, for how long etc). These directives are included at the top of your job script and start #SBATCH
instruction.
Important
SLURM will ignore all #SBATCH
directives after the first non-comment line. Always put your #SBATCH
parameters at the top of your batch script.
For example, #SBATCH --job-name=myjob
will tell Slurm to name your job myjob
, which can help to monitor its status and potentially link it to other tasks. Some directives also have short notation: -J
is the shortcut for -
-job-name`.
Tip
It makes it clearer to use the full directive names in your job scripts.
Slurm provides hundreds of possible directives (see documentation) but there are really only a few key ones to get your job done. We list the important directives here.
--cpus-per-task
requests the number of vCPUs required per task on the same node.--cpus-per-task=4
will request that each task has 4 vCPUs allocated on the same node.--gpus-per-node
requests the number of GPUs per node.--gpus-per-node=2
will request 2 GPUs per node. GPUs remain on high demand at the moment, so only include this directive if you wish to use GPUs and ensure your compute nodes have the requested amount of GPUs available.--job-name
Specifies a name for the job. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script.--mem
or--mem-per-cpu
or--mem-per-gpu
All specify the memory in MB required per node or per vCPU or GPU respectively for the job. It is important to set these directives in a shared resource context (HPCs) but there are cases were it is not necessary because the full node would be allocated to your job (e.g. Cloud cluster)--nodes
requests that a minimum number of nodes are allocated to the job. For example,--nodes=2
will tell Slurm that 2 compute nodes should be allocated for the job. If this directive is not specified, the default behavior is to allocate as many as possible to do the job. This potentially lead to under-utilization of resources sometimes if the default allocation mode is exclusive (instead of shared).--ntasks
tell the system your job will launch a number of tasks.--ntasks=16
will tell Slurm that 16 different tasks will be launched from the job script.
Note
ntasks is usually only required for MPI workloads and requires the use of the srun command to launch the separate tasks from the job script - see below for some examples and click here for more information on srun.
--ntasks-per-node
requests a certain number of tasks to run on each node. This instruction can run fewer tasks on each node than the number of vCPUs to avoid overflowing the memory.--time
Sets a limit on the total run time of the job (wall clock). Time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. Often queues on HPCs limit jobs to 24 hours. It may be a good precaution to terminate any scripts that have the potential to get stuck without exiting e.g. if errors result in infinite loops.
Important
Job script structure: scripts are relatively always the same: written in bash with a sweep of
#SBATCH
option declarations, then loading necessary modules and libraries, and finally what the job commands are.Bash shebang option -l (
#!/bin/bash -l
): The-l
option makes “bash act as if it had been invoked as a login shell”. Login shells read certain initialization files from your home directory, such as . bash_profile . Since you set the value of TEST in your .Wall clock limit (
--time
): Always set how long your code will run instead of using the largest value. This not only makes sure you have an idea, but also can be used to get your job up the queue!srun for parallel processing: as soon as your code will use multiple cores, run the code with
srun
so the system uses optimally the resources.Avoid common directive defaults: for the main directives (mem, ntasks, exclusive, etc), you should set the requests yourself and not rely on the administrator to have appropriate defaults for your jobs.
Examples of script#
Examples are probably the best way to learn and understand Slurm directives.
Single CPU, non-parallel jobs#
Single CPU, or non-parallel, jobs are common when processing very independent pieces of data such as reducing spectra from many sources. These tasks are often simple commands.
The following script will allocate a single vCPU on one of the available nodes. The actual task here is not scientifically interesting.
#!/bin/bash
#SBATCH --job-name=singlecpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
# Your script goes here
echo "Hello. Starting my job"
sleep 30
echo "Done. Bye."
If you have many single CPU commands that you want to run at the same time, rather than consecutively on the same vCPU, you can specify a greater number of tasks and then use srun
to allocate each command to a task.
In the below example, both srun
tasks will be executed simultaneously on two different vCPUs.
#!/bin/bash
#SBATCH --job-name=singlecpu_multitasks
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
# Your script goes here
srun --ntasks=1 echo "Hello from task 1."
srun --ntasks=1 echo "Hello from task 2."
Note
In the examples above, we did not specify anything about nodes or memory. Slurm can allocate the tasks to two different nodes depending on where there are free vCPUs available.
Multi-threaded jobs#
If your task uses multi-threading, i.e. runs across multiple vCPUs on the same node, you need to request a single node and specify the --cpus-per-task
directive:
#!/bin/bash
#SBATCH --job-name=multithreaded
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
# Your script goes here
## running echo in parallel with 3 concurrent tasks
parallel -k echo ::: A B C
You can also set multiple tasks with multi-threaded commands and srun
#!/bin/bash
#SBATCH --job-name=multithreadedtasks
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
# Your script goes here
srun --ntasks=1 command1 --threads 4
srun --ntasks=1 command2 --threads 4
srun --ntasks=1 command3 --threads 4
srun --ntasks=1 command4 --threads 4
Important
The specified --cpus-per-task
must be equal to or less than the number of vCPUs available on a single compute nodes, otherwise the allocation will fail.
MPI (message passing interface) jobs#
MPI jobs can be run across multiple CPUs and multiple nodes. The --ntasks
option specifies the number of vCPUs (assuming 1 vCPU per task is the default). For example, if you want your MPI job to run across 16 vCPUs:
#!/bin/bash
#SBATCH --job-name=simplempi
#SBATCH --ntasks=16
# Your script goes here
mpirun my_mpi_script
But often you will want to compact your MPI job to a limited number of nodes. (number of nodes is often the basis for job quota costs.)
To run across 2 nodes, using 8 vCPUs per node you can add the --ntasks-per-node
option:
#!/bin/bash
#SBATCH --job-name=nodempi
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
# Your script goes here
mpirun myscript
Examples of SLURM scripts from my work#
The following runs an MPI batch job without hyperthreading.
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=72
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00
# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2
# Run the program:
srun ./myprog > prog.out
This example sets parameters using SLURM variables
#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=4
# for OpenMP:
#SBATCH --cpus-per-task=18
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00
# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# For pinning threads correctly:
export OMP_PLACES=cores
# Run the program:
srun ./myprog > prog.out
The following shows how to process parts of files and use task arrays to execute them. Task arrays are typically for loops
for SLURM.
#!/bin/bash -l
#
# This script submit to a SLURM queue
#
# Example Usage:
#
# > sbatch --array=1-10:1 slurm.script.array
#
# This should submit (in principle) <CMD> 1 then <CMD> 2 ...
#
#------------------- SLURM --------------------
#
#SBATCH --job-name=bob_template
#SBATCH --chdir=./
#SBATCH --export=ALL
# output: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --output=logs/%x_%A_%a.stdout
#SBATCH --partition=p.48h.share
#SBATCH -t 48:0:0
#SBATCH --get-user-env
#SBATCH --exclusive=user
#SBATCH --mem=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-user do-not-exist-user@mpia.de
#SBATCH --mail-type=ALL
#
# -------------- Configuration ---------------
module purge
module load anaconda/3_2019.10
module load git gcc/7.2 hdf5-serial/gcc/1.8.18 mkl/2019
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKL_HOME/lib/intel64/:${HDF5_HOME}/lib
# Set the number of cores available per processif the $SLURM_CPUS_PER_TASK is
# set
if [ ! -z $SLURM_CPUS_PER_TASK ] ; then
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
else
export OMP_NUM_THREADS=1
fi
# -------------- Commands ---------------
# Checking python distribution
function python_info(){
printf "\033[1;32m * Python Distribution:\033[0m "
python -c "import sys; print('Python %s' % sys.version.split('\n')[0])"
}
# python_info
# get the configuration slice
n_cpus=${SLURM_CPUS_PER_TASK}
n_per_process=1000
slice_index=$((${SLURM_ARRAY_TASK_ID} -1))
processing_start=$((${slice_index} * ${n_per_process}))
processing_end=$((${slice_index} * ${n_per_process} + ${n_per_process}))
CMD="./main_mars"
RUN="testsample"
# RUN="orionsample"
MOD="emulators/oldgrid_with_r0_w_adjacent.emulator.hdf5"
INF="data/testsample.vaex.hdf5"
# INF="data/orionsample.vaex.hdf5"
OUT="./results/${RUN}"
min=$((processing_start))
max=$((processing_end + 1))
NBURN=4000
NKEEP=500
NWALKERS=40
SAMPLES="samples/${RUN}"
CMDLINE="${CMD} -r ${RUN} -m ${MOD} -i ${INF} -o ${OUT} --from ${min} --to ${max} --nburn ${NBURN} --nkeep ${NKEEP} --nwalkers ${NWALKERS}"
echo ${CMDLINE}
${CMDLINE}
Slurm main commands#
This section explains how to submit your job scripts to the Slurm scheduler. Fortunately, you need to remember only a few commands.
Use
sbatch
to submit your job to the queue:
sbatch myjob.sh
Submitted batch job 577791
The submission returns an identifier, which can be helpful to interact with this particular job.
Use
squeue
to check the status of the queue or a job in the queue
squeue
[fouesneau@bachelor-login ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
577791 debug el_60 sattler R 4:10:56 1 bachelor-node91
577598 four-wks K4P10R10 klahr R 16-22:25:16 1 bachelor-node24
577755 four-wks n23_1p1_ bergez R 6-21:49:03 1 bachelor-node22
577753 four-wks n22_1p1_ bergez R 6-21:50:42 1 bachelor-node22
577752 four-wks n21_1p1_ bergez R 6-21:51:37 1 bachelor-node22
577749 four-wks n20_1p1_ bergez R 6-21:55:09 1 bachelor-node21
577746 four-wks n19_1p1_ bergez R 6-21:56:51 1 bachelor-node21
577731 four-wks Gas_delt delage R 9-17:12:59 1 bachelor-node62
577730 four-wks Full_del delage R 9-17:23:32 1 bachelor-node62
577729 four-wks Full_del delage R 9-17:28:25 1 bachelor-node62
577728 four-wks Full_del delage R 9-17:33:33 1 bachelor-node62
If you want particular job information, add the --job
option to the squeue
command
squeue --job 577791
[fouesneau@bachelor-login ~]$ squeue --job 577791
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
577791 debug el_60 sattler R 4:12:22 1 bachelor-node91
Use
scancel
to stop or cancel a job:
scancel 577791
Use
sinfo
to get information about the platform. It can be useful to understand why a job does not get allocated.
Warning
Codes will often run longer than the queues allow them. For instance, an N-Body simulation like TNG or Aquarius, runs over millions of CPU hours for many months. It is critical that you have checkpoints in your codes, i.e. ways to resume from a mid-point, not restart from scratch. (see workflow design).
Evaluate job performance, post runtime, sacct
#
Slurm can provide various statistics about a given job, which includes actual memory and CPU used, IO activity, etc. sacct
is an essential tool to evaluate resource requirements, scaling efficiency, and performance metrics.
In this example, we use the --format
option to request data related to job identification, memory usage, and CPU activity:
fmorg@vera01:~> sacct --user=<...> --format=Start,jobid,jobname,partition,ReqCPUS,alloccpus,elapsed,totalcpu,maxrss,reqmem,state,exitcode
Start JobID JobName Partition ReqCPUS AllocCPUS Elapsed TotalCPU MaxRSS ReqMem State ExitCode
------------------- ------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------
2023-05-22T10:02:05 79201 P2.25Q05H+ p.large 288 288 23:06:52 276-22:03+ 2000000M COMPLETED 0:0
2023-05-22T10:02:05 79201.batch batch 72 72 23:06:52 00:02.373 7360K COMPLETED 0:0
2023-05-22T10:02:05 79201.extern extern 288 288 23:06:52 00:00.007 213K COMPLETED 0:0
2023-05-22T10:02:11 79201.0 pluto 288 288 23:06:46 276-22:03+ 342247K COMPLETED 0:0
...
In this example, I removed the user id but you will need to provide one.
Warning
This example shows a job (79201) that requested 2000000M
memory, which is huge! but even more dramatic, it used only 342247K
at its peak consumption. This is unacceptable as user practice. In this case, the user probably set this request to obtain an entire node. If you need entire nodes, there are other manners to request them.
Job Arrays#
Often jobs consist of numerous identical tasks. For instance a grid search over a range of parameters, or the analysis of multiple files. A SLURM Job Array is the solution to simplify and improve your submit script(s) and reduce the load on the scheduler.
Let’s consider a project where we need to process a set of input files. We can:
Write a single script that loops through the input files and executes the processing code on each. This option is a direct approach, but sequential. It could be a very long job beyond the wall time of the queue, and it is not using the multiple CPU capacity of your system.
Write a script that processes a single file and then we submit a job for each file passing the filename as a parameter. This is relatively inexpensive to write and uses parallel execution features. However, this approach induces a lot of work on the scheduler when the file list is large. The scheduler may take a long time to register the many requests.
Write a script that processes a single file and automatically maps a file list to submit it to the scheduler as a single job with subtasks (i.e., as an array). By using a job array, we register a single job script and apply it to a large set of input files in a way that does not put undue stress on the job scheduler. This approach will also help you group and organize your workflow since all of the sub-jobs will share the same base job identifier, and it can be a good way to develop a script that works well for both a single- or large group of files.
toy example of an array job script#
The following bash snippet creates a set of files with some content.
#!/bin/bash
#
WORK_DIR="`pwd`/array_data"
N_FILES=10
echo "Creating ${N_FILES} files in ${WORK_DIR}"
# Clean up previous example data
if [[ -d ${WORK_DIR} ]]; then rm -rf ${WORK_DIR}; fi
mkdir ${WORK_DIR}
for (( k=0; k< ${N_FILES}; k++ )); do
echo "toy data k=${k}" > ${WORK_DIR}/data_${k}.dat
done
The jobs script could then look like the following
#!/bin/bash
#
#SBATCH --job-name=array_example
#SBATCH --ntasks=1
# output: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --output=%x_%A_%a.stdout
# errors: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --error=%x_%A_%a.stderr
#
# default values
F_PATH="array_data"
F_NAME="toy_data_0.dat"
F_PATH_NAME=${F_PATH}/${F_NAME}
# Make sure we can output our "processing"
if [[ ! -d ${COMPLETED_FILES_PATH} ]]; then
mkdir -p ${COMPLETED_FILES_PATH}
fi
# if a parameter is provided, we assume it is the full pathname.
[[ ! -z ${1} ]] && F_PATH_NAME=$1
# Define the name based on the array index
if [[ ! -z ${SLURM_ARRAY_TASK_ID} ]]; then
# A very valid solution is
# F_NAME="toy_data_${SLURM_ARRAY_TASK_ID}.DAT"
# More elegantly, we could also get the file list in F_PATH instead
fls=( ./${F_PATH}/*.dat )
F_NAME=${fls[${SLURM_ARRAY_TASK_ID}]}
F_PATH_NAME=${F_PATH}/${F_NAME}
fi
# do the job:
echo "Hosthame: `hostname`"
echo "File: ${F_PATH_NAME}\n"
cat ${F_PATH_NAME}
sleep 5
Our example script can also run without the scheduler. This behavior can be useful for debugging, but it is not necessary for production.
We submit this script as an array by providing an --array
directive to sbatch
:
sbatch --array=0-9 array_batch.sh
Submitted this way, the SLURM scheduler will generate one job with 10 individual sub-tasks, one for each a unique index value. The scheduler will provide a SLURM_ARRAY_TASK_ID
environment variable that one can use in various ways in the script. In our example, we use that number as the index in the list of files. We could have used it to construct the filename directly.
Slurm Environmental Variables#
When Slurm schedules a job, it needs to define the conditions to run it: directory, nodes, number of vCPUs, etc. Slurms passes these pieces as environment variables, which can be very useful to set up the code, e.g. mpirun
will infer the number of vCPUs from the environment without an explicit argument.
The following is a list of commonly used variables that are set by Slurm for each job, along with a brief description, sample value
variable name |
description |
example values |
---|---|---|
|
Number of cores/node |
|
|
Number of cores per task if |
|
|
Job ID |
|
|
Job Name |
|
|
Nodes assigned to job |
|
|
Number of nodes allocated to job |
|
|
Index to core running on within node |
|
|
Index to node running on relative to nodes assigned to job |
|
|
Total number of cores for the job |
|
|
Submit Directory |
|
|
This gives a comma-delimited list of integers representing the task per the node |
|
The list is much longer, see the online documentation