HPC introduction

HPC introduction#

Introduction#

Contemporary astronomical research would be impossible without computational resources. From designing our instruments to creating the final figures of research manuscripts, we use a broad range of computing power and storage.

Two general trends can be seen in the last 1-2 decades: (1) astronomical problems require a wide range of different resources to tackle the widely varying needs of projects, and (2) the data volumes produced by astronomical projects are growing very fast. These two trends are related. Let’s look at specific examples in the context of work at MPIA.

Researchers are used to tackling computational problems on dedicated desktops. As laptops have become more powerful, the need for dedicated desktops has decreased. A wide range of small-scale computing tasks can be handled perfectly on laptops. For instance, exploring and visualizing large datasets (e.g., Gaia), and simple to complex inference problems (e.g., calibration and data reduction).

But many research problems these days can outgrow the capabilities of the desktop/laptop computer. For instance, exploring the TNG simulations to extract the merger history of some galaxies, inferring the properties of millions of stars. If you want to cross-validate a model, you need to run multiple times a training/testing workflow (e.g., K-fold or leave-one-out) – each could take hours. Since each run is independent of all others and given enough computers, it’s theoretically possible to run them all simultaneously (in parallel). The limitations often come from the timescale rather than feasibility: a standard laptop could cross-validate any model, but the task could take months.

If you want to implement state-of-the-art physical models, calculations become more complex and expensive, such as going from 2D-LTE stellar atmospheres to 3D-NLTE simulations or extensive N-Body simulations (e.g., TNG), for instance. In these research problems, the calculations in each sub-region of the simulation are largely independent of other regions of the simulation (apart from these edges). Running each region’s calculations simultaneously (in parallel) and communicating selected results to the adjacent areas as needed is possible. Both the amount of data and the number of calculations increase significantly, and it’s theoretically possible to distribute them across multiple computers communicating over a shared network.

In all these cases, having bigger computers would help.

HPCs: High-Performance Computing systems#

HPCs (High-Performance Computing systems) are powerful computing systems designed to tackle complex and large-scale computations: for instance, intensive data processing, simulation, modeling, and analysis. HPC systems comprise several interconnected high-capacity computing nodes, each with multiple processors (CPUs, GPUs, etc.), high-speed networks, and large amounts of memory and storage. Their primary goal is to process large amounts of data in parallel across multiple machines sharing calculations using advanced algorithms and programming models such as MPI (Message Passing Interface) and OpenMP (Open Multi-Processing).

What is parallelism?#

In the context of scientific analysis, parallelism refers to the ability to perform multiple calculations at the same time. This can be achieved by using multiple processors or cores, or by dividing the calculations into smaller independent tasks and assigning them to different processors or cores. Parallelism can significantly speed up the time it takes to perform scientific analyses, such as simulations, data analysis, and machine learning.

Parallelism is not easy, and so not only does it take substantial developer time (the time it takes you to implement it), but there are computer-time costs to breaking down problems, distributing them, and recombining them, often limiting the returns you will see to parallelism.

Tip

Parallelism is the process of:

taking a single problem,
breaking it into lots of smaller problems,
assigning those smaller problems to a number of processing cores that are able to operate independently, and
recombining the results.

Limitations of parallelisation#

There are a number of challenges associated with parallelism, such as ensuring that the different processors or cores are able to communicate with each other efficiently, and avoiding race conditions where the results of one calculation are affected by the results of another calculation. However, the benefits of parallelism can outweigh the challenges, and it is an increasingly important tool for scientific analysis.

Amdahl’s law#

Amdahl’s Law describes the theoretical limits of parallelization. If $P$ is the proportion of your algorithm that can be parallelized, and $N$ is the number of cores available, the fundamental limit to the speed-up you can get from parallelization is given by: $$\begin{aligned}\textrm{total speed up} &= \frac{T_1}{T_N} \\ &= \frac{1}{\frac{P}{N} + (1 - P)} \\&\leq \frac{1}{1-P}\end{aligned}$$

This analysis neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth. If these resources do not scale with the number of processors, then merely adding processors provides even lower returns.

%matplotlib inline
import pylab as plt
import numpy as np
from mpl_toolkits.axisartist.axislines import AxesZero

P = np.array([0.2, 0.4, 0.6, 0.8, 0.95])
N = np.array([2 ** k for k in range(10)])

S_N = 1 / ((1 - P[:, None]) + P[:, None] / N[None, :])

plt.figure()
plt.subplot(axes_class=AxesZero)

for pk, snk in zip(P, S_N):
    color = plt.plot(N, snk, 'o-')[0].get_color()
    plt.text(N[-1] * 1.1, snk[-1], f'P={pk:0.1f}', va='center', color=color)

ax =  plt.gca()
for direction in ["bottom", "left"]:
    # adds arrows at the ends of each axis
    ax.axis[direction].set_axisline_style("->")
for direction in ["top", "right"]:
    # adds arrows at the ends of each axis
    ax.axis[direction].set_visible(False)
plt.title("Amdahl's Law".upper(), fontsize='large', fontweight='bold')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('N cores')
plt.ylabel('Speedup');

../../_images/4bae6844fe1e34c7d011f20f044c18c7b46c6d55dcbe932c69b2394edd7599a1.png

For example, if a task can be parallelized 80% (red curve above) of the time and there are 4 processors, the maximum speedup would be 16. This means that the task could be completed in 1/16th of the time it would take to run on a single processor. However, the Amdahl’s law shows that there is a limit to the speedup that can be achieved, no matter how many processors are used. This is because there will always be some part of the task that must be executed sequentially.

Amdahl’s law is a fundamental law of parallel computing. It is important to understand this law when designing or optimizing parallel algorithms.

Strategies: what to parallelize?#

There are many different strategies for parallelizing code. The best strategy will depend on the specific code and the hardware it is running on. However, some general strategies that can be used to parallelize code include:

It is important to identify the bottleneck of your analysis, but also which steps can run independently from one-another.

Identifying independent tasks. The first step in parallelizing code is to identify independent tasks. Independent tasks are tasks that can be executed in parallel without affecting each other. Once independent tasks have been identified, they can be assigned to different processors or cores.
Using a parallel programming model. There are many different parallel programming models available. The choice of parallel programming model will depend on the specific code and the hardware it is running on. Some popular parallel programming models include OpenMP, and MPI. Each model has pros and cons and their applications may affect the primary code significantly.
Optimizing the code. Once code is parallel, the performance can be affected by sub-optimal choices. Some common points to keep in mind for optimal performance include: reducing communication between processors, reducing synchronization between processors, and using efficient data structures.

Parallelizing code can be a challenging task, but it can also be a rewarding one.

Tip

A general rule of thumb:

you’ll achieve a larger parallel fraction if you distributed the most outer loops.

Backends: Multi-Threading vs. Multi-Processing#

There are fundamentally only 2 approaches to parallelize codes.

Multi-Threading: running tasks on a set of cores sharing information by accessing the same memory (multi-core computations)
Multi-Processing: running tasks across multiple processings which have to communicate data to share information. E.g. multiple physical units CPUs or computers.

Note

Confusingly, though, the way hyperthreading manifests is by telling your operating system that you have two cores for every physical core on your computer.

import os
import psutil

print("Number of logical CPUs: ", os.cpu_count())
print("Number of physical CPUs: ", psutil.cpu_count(logical=False))

Number of logical CPUs:  2
Number of physical CPUs:  1

So, how should you think about this?

In most contexts (if you’re not shopping for hardware), expect the terms “cores” and “processors” to both be used to refer to independent processing cores.
Don’t worry about whether they’re distributed across two different physical chips or not – all that matters is the number of cores on your computer. In most our science projects, we do not optimize to this level.
On most computers, expect the operating system to think that you have twice as many cores (sometimes referred to as “logical cores””) as you have actual “physical cores”, but expect your performance to reflect the number of physical cores you actually have.
The main frontier today reside in the boundaries of shared memory.

Multi-processing versus Multi-threading#

Multiprocessing is nice because (a) you can use it on big server clusters where different processes may be running on different computers, and (b) it’s quite safe. As a result, most parallelism you’re like to encounter will be multi-processing.

In multi-threading, all the code being run (each sequence execution of which is called a “thread”) exists within a single process, meaning that all the different threads have access to the same objects in memory. This massively reduces duplication because you don’t have to copy your code and data and pass it around – all the threads can see the same parts of memory.

But multi-threading has three major shortcomings:

It is very easy to run into very subtle but profound problems that lead to corruption (the biggest of which is something called Race Conditions).
Multi-threading can only distribute a job over the cores on a single computer, meaning it can’t be used to distribute a job over hundreds of cores in a large computing cluster.
You generally can’t use multi-thread parallelism in Python because of a fundamental component of its architecture (the GIL).

Note

Python 3.12 should provide improvements to multi-threading and reduce the GIL limitations in the future.

Warning

Parallel execution requires transfering code between different processes. This often means having your source code in a file instead of in memory. Some parallel libraries can handle in-memory code and variables transparently but not all of them.

%%file paracode.py

import random
import time
from multiprocessing import Pool as mp_Pool
from multiprocessing.pool import ThreadPool as mt_Pool

random.seed()

def doWork(N: int) -> int:
    """ Sum N random integers from 0 to 10  """
    finalSum = 0
    for _ in range(N):
        finalSum += random.randint(0, 10)
    return finalSum


def main_mp(N, Ncpus):
    # distribute the work over the available CPUs
    vals = (N // Ncpus for _ in range(Ncpus))
    # create a process Pool with 4 processes
    pool = mp_Pool(processes=Ncpus)
    # mark starting time
    startTime = time.time()
    #map doWork to availble Pool processes
    results = pool.map(doWork, vals)
    #sum the partial results to get the final result
    finalSum = sum(results)
    # mark the end time
    endTime = time.time()
    # end pool
    pool.close()
    # calculate the total time it took to complete the work
    workTime = endTime - startTime
    return workTime, finalSum

def main_mt(N, Ncpus):
    # distribute the work over the available CPUs
    vals = (N // Ncpus for _ in range(Ncpus))
    # create a process Pool with 4 processes
    pool = mt_Pool(processes=Ncpus)
    # mark starting time
    startTime = time.time()
    #map doWork to availble Pool processes
    results = pool.map(doWork, vals)
    #sum the partial results to get the final result
    finalSum = sum(results)
    # mark the end time
    endTime = time.time()
    # end pool
    pool.close()
    # calculate the total time it took to complete the work
    workTime = endTime - startTime
    return workTime, finalSum

Writing paracode.py

import paracode
import os
N = 1_000_000
Ncpus = os.cpu_count()
workTime_mp, finalSum_mp = paracode.main_mp(N, Ncpus)
workTime_mt, finalSum_mt = paracode.main_mt(N, Ncpus)

print(f"""
MultiProcessing Running N={N:,d} over Ncpus={Ncpus:,d}
    The job took {workTime_mp:0.5f} seconds to complete
Multithreading Running N={N:,d} over Ncpus={Ncpus:d}
    The job took {workTime_mt:0.5f} seconds to complete""")

MultiProcessing Running N=1,000,000 over Ncpus=2
    The job took 0.63148 seconds to complete
Multithreading Running N=1,000,000 over Ncpus=2
    The job took 0.65806 seconds to complete

ncpus = psutil.cpu_count(logical=False)
ncores = os.cpu_count()
n = 300_000

ncpus_ = range(1, ncores + 4)
ntimes = range(3)
mp_times = [[paracode.main_mp(n, int(k))[0] for k in ncpus_] for _ in ntimes]
mt_times = [[paracode.main_mt(n, int(k))[0] for k in ncpus_] for _ in ntimes]

_n = np.array(ncpus_, dtype=float)
plt.plot(_n, np.mean(mp_times, axis=0)[0] / _n, 'k:', lw=2, label='ideal')
plt.text(max(_n) + 0.1, mp_times[0][-1] / 9. , 'ideal', color='k')
for _mp_times in mp_times:
    plt.plot(_n, _mp_times, '-', lw=2, label='MP', color='C0', alpha=0.5)
for _mt_times in mt_times:
    plt.plot(_n, _mt_times, '-', lw=2, label='MT', color='C1', alpha=0.5)
plt.text(max(_n + 0.1), np.mean(mp_times, axis=0)[-1], 'MP', color='C0')
plt.text(max(_n + 0.1), np.mean(mt_times, axis=0)[-1], 'MT', color='C1')
plt.xlabel('Number of logical cores')
plt.ylabel('Time in seconds')
plt.vlines([ncpus, ncores], 0, plt.ylim()[1], color='0.5', linestyles='dashed')
plt.text(ncores * 1.02, plt.ylim()[1] * 0.8,
         "system's logical cores", color='0.5', rotation=90,
         va='top', ha='left')
plt.text(ncpus * 1.02, plt.ylim()[1] * 0.8,
         "system's physical cores", color='0.5', rotation=90,
         va='top', ha='left')
plt.xlim(0.5, max(_n + 1));

../../_images/3df8b1c28497201a5934f16ea0447e4512af539baee59c77fde0f964a695c380.png

Explore what happens if you change n or the number of repetitions. In particular, what happens if n < 10_000?

# clean up
os.remove('paracode.py')