SLURM Scheduler and batch system#

SLURM is a popular job scheduler that is used to allocate, manage and monitor jobs on an HPC platform. This chapter introduces you to the basics of submitting jobs to a SLURM scheduler.

slurm logo

What is a batch/scheduler system?#

A batch system is a piece of software that deals with collecting job instructions and allocating or scheduling resources to run the tasks. It allows users to submit workloads in the form of jobs, which are then executed automatically and asynchronously on available computing resources. Batch systems generally work by managing a queue of pending jobs and allocating resources based on policies and priorities set by administrators. They also provide features for monitoring, job accounting, and reporting. Batch systems help maximize the utilization of computing resources and improve the efficiency of HPC environments by allowing multiple jobs to run concurrently and by efficiently allocating resources to jobs based on their needs.

What does a batch scheduler do?#

The three key tasks of a job scheduler are:

  • Understanding what resources are available on the cluster: how many compute nodes are available? what size are those compute nodes? what jobs are currently running on them? etc.

  • Queuing and allocating jobs to run on compute nodes based on the resources available and the resources specified in the job script. If you submit a job that requests 4 vCPUs for 1 task, Slurm will add the job to the queue and wait for the next available compute node with 4 vCPUs to run it there. Similarly, if you submit a job that requests GPUs or specific memory allocation etc.

  • Monitoring and reporting the status of jobs: which jobs are in the queue? which jobs are running? which jobs failed? which jobs were completed successfully? which nodes are available? etc.

One of the biggest advantages of running a cluster in the cloud is the ability to easily scale up or scale down the size of your cluster as needed. This not only enables you to power through your jobs more quickly but also provides a great cost-saving benefit where you only pay for compute nodes when they are actively running jobs.

Note

AWS ParallelCluster AWS ParallelCluster integrate with Slurm: it monitors the Slurm queues to determine when to request more compute nodes, or release compute nodes that are no longer needed. For example, when current compute nodes are busy with other jobs but there are more jobs waiting for resources in the queue, ParallelCluster will assess how many compute nodes are required to run those jobs and add additional compute nodes to the cluster (up to the maximum number specified), and will then remove compute nodes once all running jobs are complete and there are no remaining jobs in the queue. This process of scaling the size of your cluster up and down by adding and removing compute nodes as required is referred to as “Auto Scaling”.

Common batch systems#

There are several commonly used batch systems on HPCs, each with its own strengths and weaknesses. Some of the most popular batch systems include:

  • Slurm: This is a widely used open-source batch system that provides excellent scalability and job management capabilities.

  • PBS/Torque: PBS (Portable Batch System) and its open-source variant Torque are widely used in HPC and scientific computing environments.

  • LSF: IBM’s Load Sharing Facility is a commercial batch system that offers good scalability and scheduling policies.

  • SGE/OpenGridScheduler: Sun Grid Engine (SGE) is an open-source batch system that is widely used in the academic and research community.

  • HTCondor: HTCondor is an open-source job scheduling system that is used primarily in high-throughput computing environments.

There are many other batch systems available as well, and the choice of which batch system to use will typically depend on the specific needs and requirements of the organization or research group.

SLURM#

SLURM is an open-source workload manager Slurm (Simple Linux Utility for Resource Management). To run test or production jobs, submit a job script (see below) to Slurm, which will find and allocate the resources required for your job (e.g. the compute nodes to run your job on).

By default, there is a limited number of jobs you can run at once. For instance, it is set to 8 on Raven, the default job submit limit is 300. If your batch jobs can’t run independently from each other, you need to use job steps.

There are mainly two types of batch jobs:

  • Exclusive, where all resources on the nodes are allocated to the job

  • Shared, where several jobs share the resources of one node. In this case, the number of CPUs and the amount of memory must be specified for each job.

You can find the MPCDF introduction on their documentation page.

Slurm main directives#

Slurm directives are small instructions to tell Slurm how to allocate your jobs to the cluster (i.e. across how many compute nodes, with how many vCPUs, for how long etc). These directives are included at the top of your job script and start #SBATCH instruction.

Important

SLURM will ignore all #SBATCH directives after the first non-comment line. Always put your #SBATCH parameters at the top of your batch script.

For example, #SBATCH --job-name=myjob will tell Slurm to name your job myjob, which can help to monitor its status and potentially link it to other tasks. Some directives also have short notation: -J is the shortcut for --job-name`.

Tip

It makes it clearer to use the full directive names in your job scripts.

Slurm provides hundreds of possible directives (see documentation) but there are really only a few key ones to get your job done. We list the important directives here.

  • --cpus-per-task requests the number of vCPUs required per task on the same node. --cpus-per-task=4 will request that each task has 4 vCPUs allocated on the same node.

  • --gpus-per-node requests the number of GPUs per node. --gpus-per-node=2 will request 2 GPUs per node. GPUs remain on high demand at the moment, so only include this directive if you wish to use GPUs and ensure your compute nodes have the requested amount of GPUs available.

  • --job-name Specifies a name for the job. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script.

  • --mem or --mem-per-cpu or --mem-per-gpu All specify the memory in MB required per node or per vCPU or GPU respectively for the job. It is important to set these directives in a shared resource context (HPCs) but there are cases were it is not necessary because the full node would be allocated to your job (e.g. Cloud cluster)

  • --nodes requests that a minimum number of nodes are allocated to the job. For example, --nodes=2 will tell Slurm that 2 compute nodes should be allocated for the job. If this directive is not specified, the default behavior is to allocate as many as possible to do the job. This potentially lead to under-utilization of resources sometimes if the default allocation mode is exclusive (instead of shared).

  • --ntasks tell the system your job will launch a number of tasks. --ntasks=16 will tell Slurm that 16 different tasks will be launched from the job script.

Note

ntasks is usually only required for MPI workloads and requires the use of the srun command to launch the separate tasks from the job script - see below for some examples and click here for more information on srun.

  • --ntasks-per-node requests a certain number of tasks to run on each node. This instruction can run fewer tasks on each node than the number of vCPUs to avoid overflowing the memory.

  • --time Sets a limit on the total run time of the job (wall clock). Time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”. Often queues on HPCs limit jobs to 24 hours. It may be a good precaution to terminate any scripts that have the potential to get stuck without exiting e.g. if errors result in infinite loops.

Important

  • Job script structure: scripts are relatively always the same: written in bash with a sweep of #SBATCH option declarations, then loading necessary modules and libraries, and finally what the job commands are.

  • Bash shebang option -l (#!/bin/bash -l): The -l option makes “bash act as if it had been invoked as a login shell”. Login shells read certain initialization files from your home directory, such as . bash_profile . Since you set the value of TEST in your .

  • Wall clock limit (--time): Always set how long your code will run instead of using the largest value. This not only makes sure you have an idea, but also can be used to get your job up the queue!

  • srun for parallel processing: as soon as your code will use multiple cores, run the code with srun so the system uses optimally the resources.

  • Avoid common directive defaults: for the main directives (mem, ntasks, exclusive, etc), you should set the requests yourself and not rely on the administrator to have appropriate defaults for your jobs.

Examples of script#

Examples are probably the best way to learn and understand Slurm directives.

Single CPU, non-parallel jobs#

Single CPU, or non-parallel, jobs are common when processing very independent pieces of data such as reducing spectra from many sources. These tasks are often simple commands.

The following script will allocate a single vCPU on one of the available nodes. The actual task here is not scientifically interesting.

#!/bin/bash

#SBATCH --job-name=singlecpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

# Your script goes here
echo "Hello. Starting my job"
sleep 30
echo "Done. Bye."

If you have many single CPU commands that you want to run at the same time, rather than consecutively on the same vCPU, you can specify a greater number of tasks and then use srun to allocate each command to a task. In the below example, both srun tasks will be executed simultaneously on two different vCPUs.

#!/bin/bash

#SBATCH --job-name=singlecpu_multitasks
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1

# Your script goes here
srun --ntasks=1 echo "Hello from task 1."
srun --ntasks=1 echo "Hello from task 2."

Note

In the examples above, we did not specify anything about nodes or memory. Slurm can allocate the tasks to two different nodes depending on where there are free vCPUs available.

Multi-threaded jobs#

If your task uses multi-threading, i.e. runs across multiple vCPUs on the same node, you need to request a single node and specify the --cpus-per-task directive:

#!/bin/bash

#SBATCH --job-name=multithreaded
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3

# Your script goes here
## running echo in parallel with 3 concurrent tasks
parallel -k echo ::: A B C

You can also set multiple tasks with multi-threaded commands and srun

#!/bin/bash

#SBATCH --job-name=multithreadedtasks
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4

# Your script goes here
srun --ntasks=1 command1 --threads 4
srun --ntasks=1 command2 --threads 4
srun --ntasks=1 command3 --threads 4
srun --ntasks=1 command4 --threads 4

Important

The specified --cpus-per-task must be equal to or less than the number of vCPUs available on a single compute nodes, otherwise the allocation will fail.

MPI (message passing interface) jobs#

MPI jobs can be run across multiple CPUs and multiple nodes. The --ntasks option specifies the number of vCPUs (assuming 1 vCPU per task is the default). For example, if you want your MPI job to run across 16 vCPUs:

#!/bin/bash

#SBATCH --job-name=simplempi
#SBATCH --ntasks=16

# Your script goes here
mpirun my_mpi_script

But often you will want to compact your MPI job to a limited number of nodes. (number of nodes is often the basis for job quota costs.) To run across 2 nodes, using 8 vCPUs per node you can add the --ntasks-per-node option:

#!/bin/bash

#SBATCH --job-name=nodempi
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8

# Your script goes here
mpirun myscript

Examples of SLURM scripts from my work#

The following runs an MPI batch job without hyperthreading.

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=72
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

# Run the program:
srun ./myprog > prog.out

This example sets parameters using SLURM variables

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=4
# for OpenMP:
#SBATCH --cpus-per-task=18
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# For pinning threads correctly:
export OMP_PLACES=cores

# Run the program:
srun ./myprog > prog.out

The following shows how to process parts of files and use task arrays to execute them. Task arrays are typically for loops for SLURM.

#!/bin/bash -l
#
# This script submit to a SLURM queue
#
# Example Usage:
#
# 	> sbatch --array=1-10:1 slurm.script.array
#
# This should submit (in principle) <CMD> 1 then <CMD> 2 ...
#
#------------------- SLURM --------------------
#
#SBATCH --job-name=bob_template
#SBATCH --chdir=./
#SBATCH --export=ALL
# output: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --output=logs/%x_%A_%a.stdout
#SBATCH --partition=p.48h.share
#SBATCH -t 48:0:0
#SBATCH --get-user-env
#SBATCH --exclusive=user
#SBATCH --mem=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-user do-not-exist-user@mpia.de
#SBATCH --mail-type=ALL
#
# -------------- Configuration ---------------

module purge
module load anaconda/3_2019.10
module load git gcc/7.2 hdf5-serial/gcc/1.8.18 mkl/2019

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKL_HOME/lib/intel64/:${HDF5_HOME}/lib

# Set the number of cores available per processif the $SLURM_CPUS_PER_TASK is
# set
if [ ! -z $SLURM_CPUS_PER_TASK ] ; then
	export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
else
	export OMP_NUM_THREADS=1
fi

# -------------- Commands ---------------

# Checking python distribution
function python_info(){
        printf "\033[1;32m * Python Distribution:\033[0m "
	python -c "import sys; print('Python %s' % sys.version.split('\n')[0])"
	}
# python_info

# get the configuration slice
n_cpus=${SLURM_CPUS_PER_TASK}
n_per_process=1000
slice_index=$((${SLURM_ARRAY_TASK_ID} -1))
processing_start=$((${slice_index} * ${n_per_process}))
processing_end=$((${slice_index} * ${n_per_process} + ${n_per_process}))

CMD="./main_mars"
RUN="testsample"
# RUN="orionsample"
MOD="emulators/oldgrid_with_r0_w_adjacent.emulator.hdf5"
INF="data/testsample.vaex.hdf5"
# INF="data/orionsample.vaex.hdf5"
OUT="./results/${RUN}"
min=$((processing_start))
max=$((processing_end + 1))
NBURN=4000
NKEEP=500
NWALKERS=40
SAMPLES="samples/${RUN}"

CMDLINE="${CMD} -r ${RUN} -m ${MOD} -i ${INF} -o ${OUT} --from ${min} --to ${max} --nburn ${NBURN} --nkeep ${NKEEP} --nwalkers ${NWALKERS}"
echo ${CMDLINE}
${CMDLINE}

Slurm main commands#

This section explains how to submit your job scripts to the Slurm scheduler. Fortunately, you need to remember only a few commands.

  • Use sbatch to submit your job to the queue:

sbatch myjob.sh
Submitted batch job 577791

The submission returns an identifier, which can be helpful to interact with this particular job.

  • Use squeue to check the status of the queue or a job in the queue

squeue
[fouesneau@bachelor-login ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            577791     debug    el_60  sattler  R    4:10:56      1 bachelor-node91
            577598  four-wks K4P10R10    klahr  R 16-22:25:16      1 bachelor-node24
            577755  four-wks n23_1p1_   bergez  R 6-21:49:03      1 bachelor-node22
            577753  four-wks n22_1p1_   bergez  R 6-21:50:42      1 bachelor-node22
            577752  four-wks n21_1p1_   bergez  R 6-21:51:37      1 bachelor-node22
            577749  four-wks n20_1p1_   bergez  R 6-21:55:09      1 bachelor-node21
            577746  four-wks n19_1p1_   bergez  R 6-21:56:51      1 bachelor-node21
            577731  four-wks Gas_delt   delage  R 9-17:12:59      1 bachelor-node62
            577730  four-wks Full_del   delage  R 9-17:23:32      1 bachelor-node62
            577729  four-wks Full_del   delage  R 9-17:28:25      1 bachelor-node62
            577728  four-wks Full_del   delage  R 9-17:33:33      1 bachelor-node62

If you want particular job information, add the --job option to the squeue command

squeue --job 577791
[fouesneau@bachelor-login ~]$ squeue --job 577791
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            577791     debug    el_60  sattler  R    4:12:22      1 bachelor-node91
  • Use scancel to stop or cancel a job:

scancel 577791
  • Use sinfo to get information about the platform. It can be useful to understand why a job does not get allocated.

Warning

Codes will often run longer than the queues allow them. For instance, an N-Body simulation like TNG or Aquarius, runs over millions of CPU hours for many months. It is critical that you have checkpoints in your codes, i.e. ways to resume from a mid-point, not restart from scratch. (see workflow design).

Evaluate job performance, post runtime, sacct#

Slurm can provide various statistics about a given job, which includes actual memory and CPU used, IO activity, etc. sacct is an essential tool to evaluate resource requirements, scaling efficiency, and performance metrics.

In this example, we use the --format option to request data related to job identification, memory usage, and CPU activity:

fmorg@vera01:~> sacct --user=<...> --format=Start,jobid,jobname,partition,ReqCPUS,alloccpus,elapsed,totalcpu,maxrss,reqmem,state,exitcode
              Start JobID           JobName  Partition  ReqCPUS  AllocCPUS    Elapsed   TotalCPU     MaxRSS     ReqMem      State ExitCode
------------------- ------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------
2023-05-22T10:02:05 79201        P2.25Q05H+    p.large      288        288   23:06:52 276-22:03+              2000000M  COMPLETED      0:0
2023-05-22T10:02:05 79201.batch       batch                  72         72   23:06:52  00:02.373      7360K             COMPLETED      0:0
2023-05-22T10:02:05 79201.extern     extern                 288        288   23:06:52  00:00.007       213K             COMPLETED      0:0
2023-05-22T10:02:11 79201.0           pluto                 288        288   23:06:46 276-22:03+    342247K             COMPLETED      0:0
...

In this example, I removed the user id but you will need to provide one.

Warning

This example shows a job (79201) that requested 2000000M memory, which is huge! but even more dramatic, it used only 342247K at its peak consumption. This is unacceptable as user practice. In this case, the user probably set this request to obtain an entire node. If you need entire nodes, there are other manners to request them.

Job Arrays#

Often jobs consist of numerous identical tasks. For instance a grid search over a range of parameters, or the analysis of multiple files. A SLURM Job Array is the solution to simplify and improve your submit script(s) and reduce the load on the scheduler.

Let’s consider a project where we need to process a set of input files. We can:

Write a single script that loops through the input files and executes the processing code on each. This option is a direct approach, but sequential. It could be a very long job beyond the wall time of the queue, and it is not using the multiple CPU capacity of your system.

Write a script that processes a single file and then we submit a job for each file passing the filename as a parameter. This is relatively inexpensive to write and uses parallel execution features. However, this approach induces a lot of work on the scheduler when the file list is large. The scheduler may take a long time to register the many requests.

Write a script that processes a single file and automatically maps a file list to submit it to the scheduler as a single job with subtasks (i.e., as an array). By using a job array, we register a single job script and apply it to a large set of input files in a way that does not put undue stress on the job scheduler. This approach will also help you group and organize your workflow since all of the sub-jobs will share the same base job identifier, and it can be a good way to develop a script that works well for both a single- or large group of files.

toy example of an array job script#

The following bash snippet creates a set of files with some content.

#!/bin/bash
#
WORK_DIR="`pwd`/array_data"
N_FILES=10

echo "Creating ${N_FILES} files in ${WORK_DIR}"

# Clean up previous example data
if [[ -d ${WORK_DIR} ]]; then rm -rf ${WORK_DIR}; fi
mkdir ${WORK_DIR}

for (( k=0; k< ${N_FILES}; k++ )); do
  echo "toy data  k=${k}" >  ${WORK_DIR}/data_${k}.dat
done

The jobs script could then look like the following

#!/bin/bash
#
#SBATCH --job-name=array_example
#SBATCH --ntasks=1
# output: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --output=%x_%A_%a.stdout
# errors: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --error=%x_%A_%a.stderr
#

# default values
F_PATH="array_data"
F_NAME="toy_data_0.dat"
F_PATH_NAME=${F_PATH}/${F_NAME}

# Make sure we can output our "processing"
if [[ ! -d ${COMPLETED_FILES_PATH} ]]; then
  mkdir -p ${COMPLETED_FILES_PATH}
fi

# if a parameter is provided, we assume it is the full pathname.
[[ ! -z ${1} ]] && F_PATH_NAME=$1
# Define the name based on the array index
if [[ ! -z ${SLURM_ARRAY_TASK_ID} ]]; then
  # A very valid solution is
  # F_NAME="toy_data_${SLURM_ARRAY_TASK_ID}.DAT"
  # More elegantly, we could also get the file list in F_PATH instead
  fls=( ./${F_PATH}/*.dat )
  F_NAME=${fls[${SLURM_ARRAY_TASK_ID}]}
  F_PATH_NAME=${F_PATH}/${F_NAME}
fi

# do the job:
echo "Hosthame: `hostname`"
echo "File: ${F_PATH_NAME}\n"
cat ${F_PATH_NAME}
sleep 5

Our example script can also run without the scheduler. This behavior can be useful for debugging, but it is not necessary for production.

We submit this script as an array by providing an --array directive to sbatch:

sbatch --array=0-9 array_batch.sh

Submitted this way, the SLURM scheduler will generate one job with 10 individual sub-tasks, one for each a unique index value. The scheduler will provide a SLURM_ARRAY_TASK_ID environment variable that one can use in various ways in the script. In our example, we use that number as the index in the list of files. We could have used it to construct the filename directly.

Slurm Environmental Variables#

When Slurm schedules a job, it needs to define the conditions to run it: directory, nodes, number of vCPUs, etc. Slurms passes these pieces as environment variables, which can be very useful to set up the code, e.g. mpirun will infer the number of vCPUs from the environment without an explicit argument.

The following is a list of commonly used variables that are set by Slurm for each job, along with a brief description, sample value

variable name

description

example values

$SLURM_CPUS_ON_NODE

Number of cores/node

8,3

$SLURM_CPUS_PER_TASK

Number of cores per task if -cpus-per-task.

8

$SLURM_JOB_ID

Job ID

5741192

$SLURM_JOB_NAME

Job Name

array_example

$SLURM_JOB_NODELIST

Nodes assigned to job

compute-[1-3,5-9],gpunode-[1,4]

$SLURM_JOB_NUM_NODES

Number of nodes allocated to job

2

$SLURM_LOCALID

Index to core running on within node

4

$SLURM_NODEID

Index to node running on relative to nodes assigned to job

0

$SLURM_NTASKS

Total number of cores for the job

24

$SLURM_SUBMIT_DIR

Submit Directory

~user/work

$SLURM_TASKS_PER_NODE

This gives a comma-delimited list of integers representing the task per the node

2(x3),1

The list is much longer, see the online documentation