Introduction to HPC Job Scheduling
|
A job scheduler ensures jobs are given the resources they need, and manages when and where jobs will run on an HPC resource.
Obtain your DiRAC account details to use for job submission from DiRAC’s SAFE website.
Use sinfo -s to see information on all queues on a Slurm system.
Use scontrol show partition to see more detail on particular queues.
Use sbatch and squeue to submit jobs and query their status.
Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC).
Facility time may be requested through a number of mechanisms, namely in response to a Call for Proposals, the Director’s Discretionary Award, and Seedcorn Time.
The DiRAC website has further information on the methods to request access, as well as application forms.
|
Job Submission and Management
|
Use --nodes and --ntasks in a Slurm batch script to request the total number of machines and CPU cores you’d like for your job.
A typical job passes through pending, running, completing, and completed job states.
Reasons for a job’s failure include being out of memory, being suspended, cancelled, or exiting with a non-zero exit code.
Determinig backfill windows allows us to determine when we may make use of idle resources.
Use scontrol to present detailed information concerning a submitted job.
Use sstat to query the used resources for an actively running job.
Use sacct to query the resource accounting information for a completed job.
|
Accessing System Resources using Modules
|
HPC systems use a module loading/unloading system to provide access to software.
To see the available modules on a system, we use module avail .
The software installed across the DiRAC sites can be different in terms of what’s installed and the versions that are available.
module list will show us which modules we currently have loaded.
We use module load and module unload to grant and remove access to modules on the system.
We should only keep loaded those modules we actively wish to use, and try to avoid loading multiple versions of the same software.
|
Using Different Job Types
|
The login node is shared with all other users so be careful not to run complex programs on them.
OpenMP allows programmers to specify sections of code to execute in parallel threads on a single CPU, using compiler directives.
MPI programs make use of message-passing between multiple processes to coordinate communication.
OpenMP uses a shared memory model, whilst MPI uses a distributed memory model.
An array job is a self-contained set of multiple jobs - known as tasks - managed as a single job.
Interactive jobs allow us to interact directly with a compute node, so are very useful for debugging and exploring the running of code in real-time.
Interactive jobs consume resources while an interactive session is active, so we must be careful to use them efficiently.
Run and develop applications at a small scale first, before making use of powerful scaling techniques, to avoid potentially expensive consumption of resources.
|
Survey
|
|
{:auto_ids}
key word 1
: explanation 1
key word 2
: explanation 2