Introduction to HPC Job Scheduling: Glossary

Key Points

Introduction to HPC Job Scheduling
  • A job scheduler ensures jobs are given the resources they need, and manages when and where jobs will run on an HPC resource.

  • Obtain your DiRAC account details to use for job submission from DiRAC’s SAFE website.

  • Use sinfo -s to see information on all queues on a Slurm system.

  • Use scontrol show partition to see more detail on particular queues.

  • Use sbatch and squeue to submit jobs and query their status.

  • Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC).

  • Facility time may be requested through a number of mechanisms, namely in response to a Call for Proposals, the Director’s Discretionary Award, and Seedcorn Time.

  • The DiRAC website has further information on the methods to request access, as well as application forms.

Job Submission and Management
  • Use --nodes and --ntasks in a Slurm batch script to request the total number of machines and CPU cores you’d like for your job.

  • A typical job passes through pending, running, completing, and completed job states.

  • Reasons for a job’s failure include being out of memory, being suspended, cancelled, or exiting with a non-zero exit code.

  • Determinig backfill windows allows us to determine when we may make use of idle resources.

  • Use scontrol to present detailed information concerning a submitted job.

  • Use sstat to query the used resources for an actively running job.

  • Use sacct to query the resource accounting information for a completed job.

Accessing System Resources using Modules
  • HPC systems use a module loading/unloading system to provide access to software.

  • To see the available modules on a system, we use module avail.

  • The software installed across the DiRAC sites can be different in terms of what’s installed and the versions that are available.

  • module list will show us which modules we currently have loaded.

  • We use module load and module unload to grant and remove access to modules on the system.

  • We should only keep loaded those modules we actively wish to use, and try to avoid loading multiple versions of the same software.

Using Different Job Types
  • The login node is shared with all other users so be careful not to run complex programs on them.

  • OpenMP allows programmers to specify sections of code to execute in parallel threads on a single CPU, using compiler directives.

  • MPI programs make use of message-passing between multiple processes to coordinate communication.

  • OpenMP uses a shared memory model, whilst MPI uses a distributed memory model.

  • An array job is a self-contained set of multiple jobs - known as tasks - managed as a single job.

  • Interactive jobs allow us to interact directly with a compute node, so are very useful for debugging and exploring the running of code in real-time.

  • Interactive jobs consume resources while an interactive session is active, so we must be careful to use them efficiently.

  • Run and develop applications at a small scale first, before making use of powerful scaling techniques, to avoid potentially expensive consumption of resources.

Survey

Glossary

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1
explanation 1
key word 2
explanation 2