Introduction to High Performance Computing: Key Points

Introduction to HPC Systems

High Performance Computing (HPC) combines many powerful computers (nodes) into clusters that work together to solve large or complex computational problems faster than a personal computer.
HPC is essential when problems are too big, data too large, or computations too slow for a single machine.
HPC facilities in the UK are divided into tiers: the largest systems categorised in higher tiers. The University of Southampton’s HPC system is a local tier 3 facility and you can get access to use it.

Access Iridis 6 and Iridis X requires applying for an account via a short online form, including project details and computing needs.
Iridis On Demand provides a web-based interface for accessing the HPC system, managing files, submitting jobs, and running interactive applications like Jupyter notebooks.
Connections to HPC systems are made securely using SSH (Secure Shell), typically through a public-private key pair for secure authentication.
Data transfer to and from HPC systems can be done using SSH-based tools such as sftp, or rsync, or through GUI tools like FileZilla or the Iridis On Demand.
Software on HPC systems is managed using Environment Modules, which allow users to load, unload, and switch between software packages and versions.
Help and documentation are available through the HPC Community Wiki, Teams HPC Community.
There is a team of Research Software Engineers in the Research Software Group whose role is to help researchers port and optimise code for use on HPC systems.

The job scheduler (like Slurm) manages all user jobs to ensure fair and efficient use of the cluster.
Login nodes are for light tasks (editing, compiling); compute nodes are for running scheduled, intensive jobs.
Use sinfo and scontrol to query the status of partitions (queues) and nodes.
A job script is a Bash script containing #SBATCH directives (resource requests) and the commands to be run.
Use sbatch to submit a job, squeue to monitor its status, and scancel to cancel it.
Use sinteractive to request a live terminal session on a compute node for debugging or interactive work.

Parallelisation speeds up computation by dividing work across multiple processing units.
Processes use private memory and communicate information explicitly between them (distributed memory, e.g. MPI).
Threads share memory within a process and require synchronisation to prevent race conditions.
Shared memory parallelisation is simpler but limited in scale. Distributed memory scales better, but is more complex.

Any language can be used for HPC, however, compiled languages like C, C++ and Fortran are typically used for performance-critical code. Python is often used on HPC using libraries which are built on compiled languages.
OpenMP is a shared-memory model, using compiler directives (#pragma omp) to easily parallelise code (often loops) to run on the CPU cores of a single node.
MPI is a distributed-memory model, using a library of functions (MPI_Init, MPI_Send, etc.) to manage explicit communication between processes. It is complex but can scale across many nodes.
GPUs are “many-core” processors ideal for massive, simple, parallel tasks (like matrix maths). Using them requires copying data between the CPU (host) and GPU (device).
OpenACC uses compiler directives (#pragma acc) to offload work to a GPU, automating parallelisation and data transfers.
CUDA is a complex, explicit programming model for NVIDIA GPUs. It requires you to write “kernels” and manually manage memory (cudaMalloc, cudaMemcpy) but offers the highest control and performance.

Understanding scalability is crucial for using HPC resources efficiently and avoiding waste.
Scalability measures how efficiently code uses additional resources for a fixed problem (strong scaling) or a proportionally growing problem (weak scaling).
A program’s scalability is limited by bottlenecks, such as serial code (Amdahl’s Law), communication overhead, I/O, and load imbalance.
Premature optimisation adds complexity and risks introducing bugs.
Always profile code first to identify the actual performance bottlenecks before attempting optimisation.
Establish a robust test plan to verify that any optimisations do not alter the correctness of the results.