Glossary

Key Points

Introduction to HPC Job Scheduling	A job scheduler ensures jobs are given the resources they need, and manages when and where jobs will run on an HPC resource. Obtain your DiRAC account details to use for job submission from DiRAC’s SAFE website. Use `sinfo -s` to see information on all queues on a Slurm system. Use `scontrol show partition` to see more detail on particular queues. Use `sbatch` and `squeue` to submit jobs and query their status. Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC). Facility time may be requested through a number of mechanisms, namely in response to a Call for Proposals, the Director’s Discretionary Award, and Seedcorn Time. The DiRAC website has further information on the methods to request access, as well as application forms.
Job Submission and Management	Use `--nodes` and `--ntasks` in a Slurm batch script to request the total number of machines and CPU cores you’d like for your job. A typical job passes through pending, running, completing, and completed job states. Reasons for a job’s failure include being out of memory, being suspended, cancelled, or exiting with a non-zero exit code. Determinig backfill windows allows us to determine when we may make use of idle resources. Use `scontrol` to present detailed information concerning a submitted job. Use `sstat` to query the used resources for an actively running job. Use `sacct` to query the resource accounting information for a completed job.
Accessing System Resources using Modules	HPC systems use a module loading/unloading system to provide access to software. To see the available modules on a system, we use `module avail`. The software installed across the DiRAC sites can be different in terms of what’s installed and the versions that are available. `module list` will show us which modules we currently have loaded. We use `module load` and `module unload` to grant and remove access to modules on the system. We should only keep loaded those modules we actively wish to use, and try to avoid loading multiple versions of the same software.
Using Different Job Types	The login node is shared with all other users so be careful not to run complex programs on them. OpenMP allows programmers to specify sections of code to execute in parallel threads on a single CPU, using compiler directives. MPI programs make use of message-passing between multiple processes to coordinate communication. OpenMP uses a shared memory model, whilst MPI uses a distributed memory model. An array job is a self-contained set of multiple jobs - known as tasks - managed as a single job. Interactive jobs allow us to interact directly with a compute node, so are very useful for debugging and exploring the running of code in real-time. Interactive jobs consume resources while an interactive session is active, so we must be careful to use them efficiently. Run and develop applications at a small scale first, before making use of powerful scaling techniques, to avoid potentially expensive consumption of resources.
Survey

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1: explanation 1
key word 2: explanation 2

Introduction to HPC Job Scheduling: Glossary

Key Points

Glossary