HPC for MPIA

HPC for MPIA#

As part of the MPIA employees, we have access to varied computation facilities and resources. In this chapter, we will focus on those MPIA resources.

Type of resources#

More specifically and in addition to your laptop or desktop, MPIA gives you remote access to

Dedicated high-spec machines (astro-node)

Projects sometimes plan their computation needs and invest in dedicated powerful hardware. For instance, more and more projects invest in GPU capacities or high-memory machines. At MPIA, these machines are commonly set as an “astro-node.” These machines are equivalent to a single node of an HPC system. The Astro-nodes at MPIA comprise 2,692 CPUs and some limited shared file systems. It represents a pool of (powerful) “desktop” computers. We will not discuss further the Astro-nodes in this chapter.

In-house clusters

At MPIA, we have 5 in-house clusters of non-homogeneous platforms funded by and accessible to individual groups. This represents ~2 300 CPUs across the systems (BEEHIVE, BACHELOR_NEW, ANTHILL, GEMINI_NEW, KARUN). Users submit their jobs/tasks to a queue system (SLURM) that will allocate the resources and runs the workflow when possible. These machines are accessible to only MPIA employees (after requesting accounts).

MPCDF Shared High-performance computing (HPC) resources

MPIA staff have access to the resources of the MPCDF (Max-Planck for computation and data facility; formerly RZG), a cross-institutional center of MPG, to support computational and data sciences. All MPIs employees (after requesting MPCDF account) can use some clusters. The access to some machines may be restricted by who funded those.

Between RAVEN, COBRA, and VERA, the MPCDF provides us with over 250,000 CPUs and ~900 GPUs (across ~5000 nodes). These clusters constantly run at ~90% capacity, where MPIA represents about 4 to 8% of the computation load over the 25 MPIs. Users must submit their jobs to the queue system and wait for execution. Jobs are limited in execution to 24 hours unless special permission is given. In addition, the number of running jobs per user is capped at 100 and 500 in total in the queue system. Within these resources, the MPCDF maintains the MPIA-dedicated HPC, VERA, which also offers ~4,000 CPUs and 12 GPUs with interconnected nodes of diverse memory. This machine is idle ~10-15% on average per year and used by MPIA employees only and provides on average 2.3M core hours per month.

RAVEN (2021)
- 1592 nodes
- 114,624 Intel Xeon IceLake-SP CPU-cores, 375 TB RAM (DDR4),
- 768 Nvidia A100 GPUs (192 nodes), 30 TB GPU RAM (HBM2)
COBRA (2018)
- 3424 nodes
- 136,960 Intel Xeon Skylake-SP CPU cores, 529 TB CPU RAM (DDR4),
- 128 Tesla V100-32 GPUs, 240 Quadro RTX 5000 GPUs, 7.9 TB GPU RAM HBM2
VERA (2022)
- login nodes vera[01-02] (500 GB RAM each)
- 72 execution nodes vera[001-072] (250 GB RAM each)
- 36 execution nodes vera[101-136] (500 GB RAM each)
- 2 execution nodes vera[201-202] (2 TB RAM each)
- 3 execution nodes verag[001-003] (500 GB RAM and 4 Nvidia A100-40GB GPUs each)
- node interconnect is based on Mellanox/Nvidia Infiniband HDR-100 technology (Speed: 100 Gb/s)
- Largest possible single batch job: 1680 parallel tasks (48hrs max)
- 400 TB filesystem (for the GC, PSF, and APEX, separately, with quotas)

Germany/EU HPC facilities

Of course, we are not limited to exclusively use MPG related centers. There are multiple international various supercomputing centers across Germany, Europe, and beyond. Anyone can apply through computation proposals.

What about cloud Computing?#

Cloud computing is the on-demand availability of computer system resources. Large clouds often have functions distributed over multiple locations, each being a data center.

The cloud is a generic term commonly used to refer to remote computing resources of any kind – that is, any computers that you use but are not right in front of you. Cloud can refer to machines serving websites, providing shared storage, providing webservices (such as e-mail or social media platforms), as well as more traditional “compute” resources.

Cloud computing is like a virtual cluster, capable of instantiating nodes on-demand to provide services and run tasks.

The MPCDF provides an HPC cloud facility. Their HPC-Cloud service provides virtualized computes and storage resources up to hundreds of CPU cores, dozens of GPUs, and 1000 TB of disk space. This service is similar to a virtual server room in that you pay as you go. Their service focuses on offers to allocate (virtual) machines for longer periods (e.g., 6 months, a year) without all the on-site maintenance and costs. This contrasts with commercial cloud computing, where resources are allocated for a task and released after typically hours of calculations.

GWDG also provides a cloud service that interfaces with major European and international providers.

These resources are actively growing, feel free to approach the MPIA IT and Data Science teams for more information.

The rest of this chapter will focus on the MPCDF resources but a lot of the standards apply to other platforms.

Remote connection to the entry nodes#

The first step on this platforms is to connect to the entry nodes, or interactive nodes. These are the machines allowing interactive use and the submission of jobs to the system.

HPC platforms implement strict security rules to avoid misusing powerful platforms.

MPCDF implements 2-factor authentication (2FA) mechanism and filters their HPC connections through a couple of gateways. The gateway machines gatezero.mpcdf.mpg.de and gateafs.mpcdf.mpg.de provide ssh access to MPCDF computing resources. One should note that gatezero has no access to AFS, but the home directory $HOME is local to that machine and very limited in size. You can find details on the gateways online documentation.

SSH connection#

SSH or Secure Shell is a network communication protocol is a standard in most computing systems. You can log in to the MPCDF gateway using the SSH protocol with

> ssh -XY <your_username>@gate.mpcdf.mpg.de

This should prompt you for your 2FA code as well.

The options -XY ensure that any Unix graphical output (e.g. if you run a GUI, look at an image or plot a graph) are sent back to your own laptop. You may need to specify these options explicitly.

From the gateway, you need to reach the final destination, the cluster of your choice: e.g. RAVEN:

> ssh raven
[...]
fmorg@raven01:~>

Setup your `ssh_config`#

This multi-hop connection to an HPC can often become a limitation, for example, if you have to transfer files in or out of the HPC platform. In addition, the 2FA authentication could become rapidly an overhead.

Instead, you can set your computer to directly connect you to the machines by adding some lines to your ~/.ssh/config file. In particular, you can set your computer to hop from the gateway to the final computer automatically.

For example,

Host raven
        User <your_username>
        Hostname raven.mpcdf.mpg.de
        ProxyCommand ssh -W %h:%p <username>@gate.mpcdf.mpg.de

Host cobra
        User <your_username>
        Hostname cobra.mpcdf.mpg.de
        ProxyCommand ssh -W %h:%p <username>@gate.mpcdf.mpg.de

Host vera01
        User <your_username>
        Hostname vera01.bc.rzg.mpg.de
        ProxyCommand ssh -W %h:%p <username>@gate.mpcdf.mpg.de

If you want to avoid entering your passwords and 2FA codes for every connection, you can also set your computer to remember your credentials for a given time period with the Control options.

For example,

Host *
    ControlMaster auto
    ControlPersist 10h
    ControlPath ~/.ssh/master-%C

In these instructions, we set our ssh to reuse the connection to a remote server and sharing of multiple sessions over a single network connection (ControlMaster). When set, your ssh client listens for connections on a control socket specified using the ControlPath argument. These sessions will reuse the master instance’s network connection rather than initiating new ones but will fall back to connecting normally if the control socket does not exist, or is not listening.

The ControlPersist option allows you to set a period in which you want to remember your passwords. Setting 10 hours allows you to have a full day of work without the need for entering your passwords again.

SSH keys#

Warning

MPCDF does not support SSH keys to increase their security. This means you must enter your passwords and 2FA codes in order to log into the remote server. ControlPersist reduces the overhead.

SSH keys are a set of cryptographic keys that can be used for authentication without the need for a password. Each set of keys contains a public key and a private key, and they work together to provide a more secure way of logging into remote servers. The private key is kept on the local computer, while the public key is stored on the server’s authorized_keys file. To log into the server, the user only needs to have possession of the matching private key. SSH keys are more secure than passwords because they are much more difficult to guess or crack, and they can be easily rotated if compromised.

To generate and use SSH keys, you need first to create your own private and public key pair.

Open your terminal or command prompt on your local computer and run the ssh-keygen command to generate a new SSH key pair. It will prompt you to specify a filename for the key pair and an optional passphrase for added security. The command will generate two files: a public key file (usually named id_rsa.pub) and a private key file (usually named id_rsa) in your ~/.ssh/ directory. The public key contains an ASCII signature of your computer that can only be decoded when paired with your private key.

Never share your private key, it is like giving your credit card pin code.

Services like GitHub could request your public key. You can copy the contents of the public key file (e.g., using the cat command) to your clipboard. You can manually copy your public key to the remote server authorization list ~/.ssh/authorized_keys file, or you can use the ssh-copy-id command.

That’s a basic explanation of how to use SSH keys for authentication. Keep in mind that the exact steps might vary depending on your operating system and software versions, so be sure to check the documentation specific to your system.

Using the resources and software environment (modules)#

The MPCDF uses module system to adapt the user environment for working with software installed at various locations in the file system or for switching between different software versions. Users need to explicitly specify the full version for compiler and MPI modules during compilation and in batch scripts to ensure compatibility of the MPI library.

Due to the hierarchical module environment, many libraries and applications only appear after loading a compiler, and subsequently also an MPI module (the order is important here: first, load the compiler, then load the MPI). These modules provide libraries and software consistently compiled with/for the user-selected combination of compiler and MPI library.

To search the full hierarchy, the find-module command can be used. All fftw-mpi modules, e.g., can be found using find-module fftw-mpi.

module avail to list the available software packages on the HPC system. Note that you can search for a certain module by using the find-module tool (see below).
module load package_name/version` to load a software package at a specific version.

Information on the software packages provided by the MPCDF is available here.

Note

The module library is also on the MPIA clusters (but the list of libraries differs from MPCDF)

Scheduler and batch systems#

What is a batch system?#

A batch system is the piece of software that deals with collecting job instructions and allocating or scheduling resources to run the tasks. It allows users to submit workloads in the form of jobs, which are then executed automatically and asynchronously on available computing resources. Batch systems generally work by managing a queue of pending jobs and allocating resources based on policies and priorities set by administrators. They also provide features for monitoring, job accounting, and reporting. Batch systems help maximize the utilization of computing resources and improve the efficiency of HPC environments by allowing multiple jobs to run concurrently and by efficiently allocating resources to jobs based on their needs.

Common batch systems#

There are several commonly used batch systems on HPCs, each with its own strengths and weaknesses. Some of the most popular batch systems include:

Slurm: This is a widely used open-source batch system that provides excellent scalability and job management capabilities.
PBS/Torque: PBS (Portable Batch System) and its open-source variant Torque are widely used in HPC and scientific computing environments.
LSF: IBM’s Load Sharing Facility is a commercial batch system that offers good scalability and scheduling policies.
SGE/OpenGridScheduler: Sun Grid Engine (SGE) is an open-source batch system that is widely used in the academic and research community.
HTCondor: HTCondor is an open-source job scheduling system that is used primarily in high-throughput computing environments.

There are many other batch systems available as well, and the choice of which batch system to use will typically depend on the specific needs and requirements of the organization or research group.

SLURM#

The batch system on the MPCDF & MPIA clusters is the open-source workload manager Slurm (Simple Linux Utility for Resource Management). To run test or production jobs, submit a job script (see below) to Slurm, which will find and allocate the resources required for your job (e.g. the compute nodes to run your job on).

By default, there is a limited number of jobs you can run at once. For instance, it is set to 8 on Raven, the default job submit limit is 300. If your batch jobs can’t run independently from each other, you need to use job steps.

There are mainly two types of batch jobs:

Exclusive, where all resources on the nodes are allocated to the job
Shared, where several jobs share the resources of one node. In this case, the number of CPUs and the amount of memory must be specified for each job.

You can find the MPCDF introduction on their documentation page.

Important

Job script structure: scripts are relatively always the same: written in bash with a sweep of #SBATCH option declarations, then loading necessary modules and libraries, and finally what the job commands are.
Bash shebang option -l (#!/bin/bash -l): The -l option makes “bash act as if it had been invoked as a login shell”. Login shells read certain initialization files from your home directory, such as . bash_profile . Since you set the value of TEST in your .
Wall clock limit (--time): Always set how long your code will run instead of using the largest value. This not only makes sure you have an idea, but also can be used to get your job up the queue!
srun for parallel processing: as soon as your code will use multiple cores, run the code with srun so the system uses optimally the resources.

Examples of SLURM scripts#

MPI batch job without hyperthreading

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job.out.%j
#SBATCH -e ./job.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=72
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

# Run the program:
srun ./myprog > prog.out

Using SLURM variables to set openMP

#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./job_hybrid.out.%j
#SBATCH -e ./job_hybrid.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J test_slurm
#
# Number of nodes and MPI tasks per node:
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=4
# for OpenMP:
#SBATCH --cpus-per-task=18
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit (max. is 24 hours):
#SBATCH --time=12:00:00

# Load compiler and MPI modules (must be the same as used for compiling the code)
module purge
module load intel/21.2.0 impi/2021.2

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# For pinning threads correctly:
export OMP_PLACES=cores

# Run the program:
srun ./myprog > prog.out

an example from my work

#!/bin/bash -l
#
# This script submit to a SLURM queue
#
# Example Usage:
#
# 	> sbatch --array=1-10:1 slurm.script.array
#
# This should submit (in principle) <CMD> 1 then <CMD> 2 ...
#
#------------------- SLURM --------------------
#
#SBATCH --job-name=bob_template
#SBATCH --chdir=./
#SBATCH --export=ALL
# output: {jobname}_{jobid}_{arraytask}.stdout
#SBATCH --output=logs/%x_%A_%a.stdout
#SBATCH --partition=p.48h.share
#SBATCH -t 48:0:0
#SBATCH --get-user-env
#SBATCH --exclusive=user
#SBATCH --mem=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mail-user do-not-exist-user@mpia.de
#SBATCH --mail-type=ALL
#
# -------------- Configuration ---------------

module purge
module load anaconda/3_2019.10
module load git gcc/7.2 hdf5-serial/gcc/1.8.18 mkl/2019

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MKL_HOME/lib/intel64/:${HDF5_HOME}/lib

# Set the number of cores available per processif the $SLURM_CPUS_PER_TASK is
# set
if [ ! -z $SLURM_CPUS_PER_TASK ] ; then
	export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
else
	export OMP_NUM_THREADS=1
fi

# -------------- Commands ---------------

# Checking python distribution
function python_info(){
        printf "\033[1;32m * Python Distribution:\033[0m "
	python -c "import sys; print('Python %s' % sys.version.split('\n')[0])"
	}
# python_info

# get the configuration slice
n_cpus=${SLURM_CPUS_PER_TASK}
n_per_process=1000
slice_index=$((${SLURM_ARRAY_TASK_ID} -1))
processing_start=$((${slice_index} * ${n_per_process}))
processing_end=$((${slice_index} * ${n_per_process} + ${n_per_process}))

CMD="./main_mars"
RUN="testsample"
# RUN="orionsample"
MOD="emulators/oldgrid_with_r0_w_adjacent.emulator.hdf5"
INF="data/testsample.vaex.hdf5"
# INF="data/orionsample.vaex.hdf5"
OUT="./results/${RUN}"
min=$((processing_start))
max=$((processing_end + 1))
NBURN=4000
NKEEP=500
NWALKERS=40
SAMPLES="samples/${RUN}"

CMDLINE="${CMD} -r ${RUN} -m ${MOD} -i ${INF} -o ${OUT} --from ${min} --to ${max} --nburn ${NBURN} --nkeep ${NKEEP} --nwalkers ${NWALKERS}"
echo ${CMDLINE}
${CMDLINE}