Introduction to HPC Systems


  • High Performance Computing (HPC) combines many powerful computers (nodes) into clusters that work together to solve large or complex computational problems faster than a personal computer.
  • HPC is essential when problems are too big, data too large, or computations too slow for a single machine.
  • HPC facilities in the UK are divided into tiers: the largest systems categorised in higher tiers. The University of Southampton’s HPC system is a local tier 3 facility and you can get access to use it.

Accessing and Using HPC Resources


  • Access Iridis 6 and Iridis X requires applying for an account via a short online form, including project details and computing needs.

  • Iridis On Demand provides a web-based interface for accessing the HPC system, managing files, submitting jobs, and running interactive applications like Jupyter notebooks.

  • Connections to HPC systems are made securely using SSH (Secure Shell), typically through a public-private key pair for secure authentication.

  • Data transfer to and from HPC systems can be done using SSH-based tools such as sftp, or rsync, or through GUI tools like FileZilla or the Iridis On Demand.

  • Software on HPC systems is managed using Environment Modules, which allow users to load, unload, and switch between software packages and versions.

  • Help and documentation are available through the HPC Community Wiki, Teams HPC Community.

  • There is a team of Research Software Engineers in the Research Software Group whose role is to help researchers port and optimise code for use on HPC systems.

Introduction to Job Scheduling


  • The job scheduler (like Slurm) manages all user jobs to ensure fair and efficient use of the cluster.
  • Login nodes are for light tasks (editing, compiling); compute nodes are for running scheduled, intensive jobs.
  • Use sinfo and scontrol to query the status of partitions (queues) and nodes.
  • A job script is a Bash script containing #SBATCH directives (resource requests) and the commands to be run.
  • Use sbatch to submit a job, squeue to monitor its status, and scancel to cancel it.
  • Use sinteractive to request a live terminal session on a compute node for debugging or interactive work.

Introduction to Programmatic Parallelism


  • Parallelisation speeds up computation by dividing work across multiple processing units.
  • Processes use private memory and communicate information explicitly between them (distributed memory, e.g. MPI).
  • Threads share memory within a process and require synchronisation to prevent race conditions.
  • Shared memory parallelisation is simpler but limited in scale. Distributed memory scales better, but is more complex.

Landscape of HPC Technologies


  • Any language can be used for HPC, however, compiled languages like C, C++ and Fortran are typically used for performance-critical code. Python is often used on HPC using libraries which are built on compiled languages.
  • OpenMP is a shared-memory model, using compiler directives (#pragma omp) to easily parallelise code (often loops) to run on the CPU cores of a single node.
  • MPI is a distributed-memory model, using a library of functions (MPI_Init, MPI_Send, etc.) to manage explicit communication between processes. It is complex but can scale across many nodes.
  • GPUs are “many-core” processors ideal for massive, simple, parallel tasks (like matrix maths). Using them requires copying data between the CPU (host) and GPU (device).
  • OpenACC uses compiler directives (#pragma acc) to offload work to a GPU, automating parallelisation and data transfers.
  • CUDA is a complex, explicit programming model for NVIDIA GPUs. It requires you to write “kernels” and manually manage memory (cudaMalloc, cudaMemcpy) but offers the highest control and performance.

Measuring and improving parallel performance


  • Understanding scalability is crucial for using HPC resources efficiently and avoiding waste.
  • Scalability measures how efficiently code uses additional resources for a fixed problem (strong scaling) or a proportionally growing problem (weak scaling).
  • A program’s scalability is limited by bottlenecks, such as serial code (Amdahl’s Law), communication overhead, I/O, and load imbalance.
  • Premature optimisation adds complexity and risks introducing bugs.
  • Always profile code first to identify the actual performance bottlenecks before attempting optimisation.
  • Establish a robust test plan to verify that any optimisations do not alter the correctness of the results.