Introduction to HPC Job Scheduling
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is a job scheduler and why does a cluster need one?
How do I find out what parameters to use for my Slurm job?
How do I submit a Slurm job?
What are DiRAC project allocations and how do they work?
Objectives
Describe briefly what a job scheduler does.
Recap the fundamentals of Slurm job submission and monitoring.
Use Slurm and SAFE to determine the submission parameters to use to submit a job.
Summarise the main ways researchers may request time on a DiRAC facility.
job scheduling: A Recap
An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.
The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your job does not start instantly as on your laptop.
The scheduler used here is Slurm, and although Slurm is not used everywhere, running jobs is quite similar regardless of what software is being used. The exact syntax might change, but the concepts remain the same.
Fundamentals of Job Submission and Monitoring: A Recap
You may recall from a previous course, or your own experience, two basic Slurm commands;
namely sbatch
to submit a job to an HPC resource, and squeue
to query the state of a submitted job.
When we submit a job, we typically write a Slurm script which embodies the commands we wish to run on a compute node.
A job script (e.g. basic-script.sh
) is typically written using the
Bash shell language and a very minimal example looks something like this (save this in a file called basic-script.sh
, but don’t try to submit it just yet!):
#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=1
#SBATCH --time=00:00:30
date
The #SBATCH
lines are special comments that provide additional information about our job to Slurm;
for example the account and partition, the maximum time we expect the job to take when running,
and the number of nodes we’d like to request (in this case, just one).
We’ll look at other parameters in more detail later,
but let’s focus on specifying a correct set of minimal parameters first.
--account
It’s important to note that what you specify for the --account
parameter is not your
machine login username or SAFE login; it’s the project account to which you have access.
Project accounts are assigned an allocation of resources (such as CPU or disk space),
and to use them, you specify the project account code in the --account
parameter.
You can find the projects to which you have access in your DiRAC’s SAFE account.
To see them, after you login to SAFE, select Projects
from the top navigation bar
and select one of your projects to see further details, where you’ll find the project’s
account code you can use.
--partition
The underlying mechanism which enables the scheduling of jobs across HPC systems like Slurm is that of the queue (or partition). A queue represents a list of (to some extent) ordered jobs to be executed on HPC compute resources, and sites often have many different queues which represent different aspects, such as the level of prioritisation for jobs, or the capabilities of compute nodes So when you submit a job, it will enter one of these queue to be executed. How these queues are set up across multi-site HPC systems such as DiRAC can differ, depending on local institutional infrastructure configurations, user needs, and site policies.
You can find out the queues available on a DiRAC site, and their current state, using:
sinfo -s
The -s
flag curtails the output to only a summary,
whereas omitting this flag provides a full listing of nodes in each queue and their current state.
You should see something like the following. This is an example from COSMA:
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
cosma7 up 3-00:00:00 80/137/7/224 m[7229-7452]
cosma7-pauper up 1-00:00:00 80/137/7/224 m[7229-7452]
cosma7-shm up 3-00:00:00 0/1/0/1 mad02
cosma7-shm-pauper up 1-00:00:00 0/1/0/1 mad02
cosma7-bench up 1-00:00:00 0/0/1/1 m7452
cosma7-rp up 3-00:00:00 82/140/2/224 m[7001-7224]
cosma7-rp-pauper* up 1-00:00:00 82/140/2/224 m[7001-7224]
Here we can see the general availability of each of these queues (also known as partitions), as well as the maximum
time limit for jobs on each of these queues (in days-hours:minutes:seconds
format).
For example, on the cosma7
queue there is a 3-day limit, whilst the cosma7-pauper
queue has a 1-day limit. The *
beside the partition name indicates it is the default queue (although it’s always
good practice to specify this explicitly).
Queues on Other Sites
On other DiRAC sites, the queues displayed with
sinfo
will look different, for example:
- Edinburgh’s Tursa: there are
cpu
andgpu
queues, as well as other gpu-specific queues.- Cambridge’s CSD: there are a number of queues for different CPU and GPU architectures, such as
cclake
,icelake
,sapphire
, andampere
.- Leicester’s DiAL3: there are two queues named
slurm
anddevel
, withslurm
referring to the main HPC resource anddevel
for development or testing use (i.e. short running jobs).
Note that the queues to which you have access will depend on your allocation setup, and this may not include the default queue (for example, if you only have access to GPU resources which are accessible on their own queue, like on Tursa, you’ll need to use one of these queues instead).
To find out more information on queues, you can use the scontrol show
command, which allows you to view the configuration of Slurm and it’s current state.
So to see a breakdown of a particular queue, you can do (replacing <partition_name>
):
scontrol show partition=<partition_name>
So on COSMA, for example:
scontrol show partition=cosma7
An example of the output, truncated for clarity:
PartitionName=cosma7
AllowGroups=cosma7 AllowAccounts=do009,dp004,dp012,dp050,dp058,dp060,dp092,dp121,dp128,dp203,dp260,dp276,dp278,dp314,ds007 AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=m[7229-7449]
...
In particular, we can see the accounts (under AllowAccounts
) that have access to this queue
(which may display ALL
depending on the queue and the system).
To see a complete breakdown of all queues you can use:
scontrol show partitions
Other Parameters
Depending on your site and how your allocation is configured you may need to specify other parameters in your script.
On Tursa for example, you may need to specify a --qos
parameter in the script which stands for quality of service, which is used to constrain or modify the characteristics of a job.
On other sites, a default --qos
is already present and doesn’t need to be explicitly supplied.
So for example on Tursa, we can use scontrol show partition
to display the allowed QoS’ for a particular queue, e.g.:
scontrol show partition=cpu
PartitionName=cpu
AllowGroups=ALL AllowAccounts=ALL AllowQos=sysadm,standard,high,short,debug,low,reservation
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
Nodes=tu-c0r0n[66-71]
...
We can see under AllowQos
those that are permitted. So for example using the cpu
queue on Tursa, in the batch script you may need to add a line containing #SBATCH --qos=standard
(e.g. below the other #SBATCH
directives) for jobs to work - make a note of this!
Submitting a Job!
Let’s check that we can first submit a job to Slurm so we can verify that we have a working set of Slurm parameters. As mentioned, these will vary depending on your circumstances, such as the projects to which you have access to on DiRAC, and the DiRAC site to which you are submitting jobs.
Once you’ve determined these, edit
basic-script.sh
(as shown above), substitute the correct--account
and--partition
values and add any additional parameters needed for your site (e.g.--qos
), and then save the file. Next, submit that script usingsbatch basic-script.sh
. It may take some trial and error to find the correct parameters! Once you’ve successfully submitted the job, you should have a job identifier returned in the terminal (something like309001
).Lastly, you can use that job identifier to query the status of our job until it’s completed using
squeue <jobid>
(e.g.squeue 309001
). Once the job is complete, we are able to read the job’s log file, typically held in a file namedslurm-<jobid>.out
, which show us any printed output from a job, and depending on the HPC system, perhaps other information regarding how and where the job ran.
DiRAC Project Allocations
Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC), which provides access through allocations to researchers who request time on the facility. There are a number of mechanisms through which facility time can be requested:
- Call for full Proposals: the RAC puts out an annual invitation to the UK theory and modelling communities to apply for computational resources on the DiRAC HPC Facility, with applications taking the form of scientific, technical, or Research Software Engineer support (RSE) time.
- Director’s Discretionary Award: from time to time the DiRAC Director invites the UK theory and modelling communities to apply for discretionary allocations of computational resources. Discretionary time can also be applied for if you find you will be using your code at a larger scale than was previously requested in a full call for proposals. Applications can be made at any time.
- Seedcorn Time: for researchers who would like to get a feel for HPC, test and benchmark codes, or see what DiRAC resources can do for you before making a full application, an application can be made for seedcorn time. Existing users may also apply for seedcorn allocations to enable code development/testing on a service that is not currently part of their project allocation. You can apply for Seedcorn Time at any time.
For more information regarding these options, and for online application forms and contact details for enquiries, see the DiRAC website.
Once the submission and its technical case has been approved, allocations are managed within in 3-month chunks over the duration of the project, which may be over a period of years. Allocation usage is based primarily on core CPU hours or GPU hours. Following a 3-month allocation project usage is reviewed and extended based on that review (for instance, any non-use of the allocation during that 3-month window is queried, with support provided to overcome any barriers to use.) In addition, Research Software Engineering (RSE) support time may also be requested, who can provide help with code optimisation, porting, re-factoring and performance analysis.
Key Points
A job scheduler ensures jobs are given the resources they need, and manages when and where jobs will run on an HPC resource.
Obtain your DiRAC account details to use for job submission from DiRAC’s SAFE website.
Use
sinfo -s
to see information on all queues on a Slurm system.Use
scontrol show partition
to see more detail on particular queues.Use
sbatch
andsqueue
to submit jobs and query their status.Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC).
Facility time may be requested through a number of mechanisms, namely in response to a Call for Proposals, the Director’s Discretionary Award, and Seedcorn Time.
The DiRAC website has further information on the methods to request access, as well as application forms.
Job Submission and Management
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How do I request specific resources to use for a job?
What is the life cycle of a job?
What can I do to specify how my job will run?
How can I find out the status of my running or completed jobs?
Objectives
Specify a job’s resource requirements in a Slurm batch script.
Describe the main states a job passes through from start to completion.
Cancel multiple jobs using a single command.
Describe how backfilling works and why it’s useful.
Obtain greater detail regarding the status of a completed job and its use of resources.
We’ve recapped the fundamentals of how we are able to use Slurm to submit a simple job and monitor it to completion, but in practice we’ll need more flexibility in specifying what resources we need for our jobs, how they should run, and how we manage them, so let’s look at that in more detail now.
Selecting Resources for Jobs
Selecting Available Resources
We saw earlier how to use sinfo -s
as a means to obtain a list of queues we are able to access,
but there’s also some additional useful information.
Of particular interest is the NODES
column, which gives us an overview of the state of these resources, and hence
allows us to select a queue with those resources that are sufficiently available for jobs we’d like to submit. For example, on Tursa you may see something like:
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
slurm* up 4-00:00:00 198/0/0/198 dnode[001-198]
devel up 2:00:00 198/2/0/200 dnode[001-200]
Looking at the NODES
column, it indicates how many nodes are:
A
ctive - these are running jobsI
dle - no jobs are runningO
ther - these nodes are down, or otherwise unavailable- The
T
otal number of nodes
The NODELIST
is a summary of those nodes in a particular queue.
In this particular instance, we can see that 2 nodes are idle in the devel
queue,
so if that queue fits our needs (and we have access to it) we may decide to submit to that.
Specifying Job Resource Requirements
One thing that is absolutely critical when working on an HPC system is specifying the resources required to run a job, which allows the scheduler to find the right time and place to schedule our job. If you do not specify requirements (such as the amount of time you need), you will likely be stuck with your site’s default resources, which is probably not what you want.
When launching a job, we can specify the resources our job needs in our job script,
using #SBATCH <parameter>
. We’ve used this format before when indicating the account and queue to use.
You may have seen some of these parameters before, but let’s take a look at the most important
ones and how they relate to each other.
For these parameters, by default a task refers to a single CPU unless otherwise indicated.
--nodes
- the total number of machines or nodes to request--ntasks
- the total number of CPU cores (across requested nodes) your job needs. Generally, this will be 1 unless you’re running MPI jobs which are able to use multiple CPU cores. In which case, it essentially specifies the number of MPI ranks to start across the nodes. For example, if--nodes=4
and--ntasks=8
, we’re requesting 4 nodes, with each node having 2 CPU cores, since to satisfy the requirement of a total of 8 CPU cores over 4 nodes, each node would need 2 cores. So here, Slurm does this calculation for us. We’ll see an example of using--ntasks
with MPI in a later episode.
Being Specific
Write and submit a job script - perhaps adapted from our previous one - called
multi-node-job.sh
that requests a total of 2 nodes with 2 CPUs per node. Remember to include any site-specific#SBATCH
parameters.Solution
#!/bin/bash -l #SBATCH --account=yourAccount #SBATCH --partition=aPartition #SBATCH --nodes=2 #SBATCH --ntasks=4 #SBATCH --time=00:00:30 echo "This script is running on ... " hostname
sbatch multi-node-job.sh ... squeue -u yourUsername
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6485195 cosma7-pa mul-job. yourUse+ R 0:05 2 m[7400-7412]
Here we can see that the job is using a total of two nodes as we’d like -
m7400
andm7401
.
Our example earlier with --nodes=4
and --ntasks=8
meant Slurm calculated that 2 cores would be needed on each of the four nodes (i.e. ntasks/nodes = 2 cores per node
). So the number of cores needed was implicit.
However, we can also specify the number of CPU cores per node explicitly using --ntasks-per-node
.
In this case, we use this with --nodes
and we don’t need to specify the total number of tasks with --ntasks
at all.
So using our above example with --nodes=4
, to get our desired total 8 CPU cores we’d specify --ntasks-per-node=2
.
Being Even More Specific
Write and submit a job script that uses
--nodes
and--ntasks-per-node
to request a total of 2 nodes with 2 CPUs per node.Solution
#!/bin/bash -l #SBATCH --account=yourAccount #SBATCH --partition=aPartition #SBATCH --nodes=2 #SBATCH --ntasks-per-node=2 #SBATCH --time=00:00:30 echo "This script is running on ... " hostname
Once submitted, using
squeue
we should see the same results as before, again using a total of two nodes.
We’ll be looking at how we can make full use of these parameters with MPI jobs later in this lesson.
Up until now, the number of tasks have been synonymous with the number of CPUs. However, we can also specify a multiple
number of CPUs per task too, by using --cpus-per-task
- so if your job is multithreaded, for example it makes use of
OpenMP, you can specify how many threads it needs using this parameter.
What about Parameters Named -A, -N, -J and others?
You may also see some other shorter parameter names. In Slurm, some
SBATCH
options also have shortened forms, but mean the same thing. For example:
Long form Short form --job-name
-J
--nodes
-N
--ntasks
-n
--cpus-per-task
-c
Not all parameters have a short form, for example
--nodes-per-task
and--exclusive
.In practice you can use either, although using the longer form is more verbose and helps remove ambiguity, particularly for newer users.
Only Request What you Require
When specifying resources your job will need it’s important not to ask for too much. Firstly, because any resources you request but don’t use (e.g. CPUs, memory, GPUs) will be wasted and potentially cost more in terms of your account usage, but also because requesting larger resources will take longer to queue. It also means that other user’s jobs that would have been a better fit for these resources may also take longer to run. It’s good to remember that we are part of a wider community of HPC users, and as with any shared resource, we should act responsibly when using it.
As we’ll see later, using the
sacct
command we’re able to find out what resources previous jobs actually used, and then optimise the resources we request in future job runs.
Managing Job Submissions
Monitoring a Job
As we’ve seen, we can check on our job’s status by using the command squeue
. Let’s take a look in more detail.
squeue -u yourUsername
You may find it looks like this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5791510 cosma7-pa example- yourUser PD 0:00 1 (Priority)
So -u yourUsername
shows us all jobs associated with our machine account. We can also use -j
to query specific
job IDs, e.g.: squeue -j 5791510
which will, in this case, yield the same information since we only have that job
in the queue (if it hasn’t already completed!).
In either case, we can see all the details of our job, including the partition, user, and also the state of the job (in the ST
column).
In this case, we can see it is in the PD or PENDING state.
Typically, a successful job run will go through the following states:
PD
- pending: sometimes our jobs might need to wait in a queue first before they can be allocated to a node to run.R
- running: job has an allocation and is currently runningCG
- completing: job is in the process of completingCD
- completed: the job is completed
For pending jobs, helpfully, you may see a reason for this in the NODELIST(REASON)
column;
for example, that the nodes required for the job are down or reserved for other jobs,
or that the job is queued behind a higher priority job that is making use of the requested resources.
Once it’s able to run, the nodes that have been allocated will be displayed in this column instead.
However, in terms of job states, there are a number of reasons why jobs may end due to a failure or other condition, including:
OOM
- ouf of memory: the job attempted to use more memory during execution than what was availableS
- suspended: job has an allocation, but it has been suspended to allow for other jobs to use the resourcesCA
- cancelled: the job was explicitly cancelled, either by the user or system administrator, and may or may not have been startedF
- failed: the job has terminated with a non-zero exit code or has failed for another reason
You can get a full list of job status codes via the SLURM documentation.
Cancelling a Job
Sometimes we’ll make a mistake and need to cancel a job. This can be done with
the scancel
command. Let’s submit a job and then cancel it using
its job number (remember to change the walltime so that it runs long enough for
you to cancel it before it is killed!).
sbatch example-job.sh
squeue -u yourUsername
Submitted batch job 5791551
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5791551 cosma7-pa hello-wo yourUser PD 0:00 1 (Priority)
Now cancel the job with its job number (printed in your terminal). A clean return of your command prompt indicates that the request to cancel the job was successful.
scancel 5791551
# It might take a minute for the job to disappear from the queue...
squeue -u yourUsername
...(no output when there are no jobs to display)...
Cancelling multiple jobs
We can also cancel all of our jobs at once using the
-u yourUsername
option. This will delete all jobs for a specific user (in this case, yourself). Note that you can only delete your own jobs.Try submitting multiple jobs and then cancelling them all.
Solution
First, submit a trio of jobs:
sbatch example-job.sh sbatch example-job.sh sbatch example-job.sh
Then, cancel them all:
scancel -u yourUsername
Backfilling
When developing code or testing configurations it is usually the case that you don’t need a lot of time. When that is true and the queues are busy, backfilling is a scheduling feature that proves really useful.
If there are idle nodes then that means they are available to run jobs, or that they are being kept so that a job can run in the future. The time between when a job needs those nodes and now is the backfill window and jobs that need less than that time may be scheduled to run on those resources. To take best advantage of backfilling, specify a shorter time period for jobs so they can fill these windows.
You can check if a particular Slurm scheduler is configured to use the backfill technique (which is the default) by doing the following:
scontrol show config | grep SchedulerType
This use of the scontrol
command allows us to see the configuration of Slurm, and by piping its output (using |
) to
the grep
command, we can search for a particular configuration field called SchedulerType
which holds the method used
for scheduling. If a backfill scheduler is being used, you should see the following:
SchedulerType = sched/backfill
If the builtin
scheduler is used instead, then jobs will be executed in order of submission, regardless if there
are any free resources to run your job immediately.
DiRAC COSMA: Handy Backfill Tools
On DiRAC’s COSMA you are able to see backfill availability windows for various queues (e.g. for COSMA5, COSMA7,and COSMA8) by using a particular backfill command. For example, for COSMA5:
c5backfill
# Backfill windows available for cosma queues: # No. hosts/slots: 1 / 16 Runtime: 31 hours 49 minutes 26 seconds Hostnames: m5336 No. hosts/slots: 5 / 80 Runtime: 72 hours 00 minutes 00 seconds Hostnames: m[5185,5315,5337,5366,5377]
So here, we can see that on COSMA5 we have only one node available for 31 hours and 49 minutes, and 5 nodes available for 72 hours.
Inspecting and Controlling the State of Jobs
So far we’ve used the squeue
command to check the general status of pending and running jobs. But we can also obtain
even more information using the scontrol
command regarding the job’s status, configuration and use of resources.
Let’s create a new job to test this, that uses a single node and a single CPU on that node to run a single basic task:
#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --time=00:00:30
#SBATCH --job-name=specific-job
sleep 25
echo "This script is running on ... "
hostname
Next, launch our new job:
sbatch specific-job.sh
…then make a note of the job ID returned, e.g.:
...
Submitted batch job 309281
We can then use this job ID to ask Slurm for more information about it:
scontrol show jobid=309281
On DiAL3, it would look something like this (although of course, on other sites it will differ in parts):
JobId=309281 JobName=specific-job
UserId=dc-crou1(210462) GroupId=dirac(59019) MCS_label=N/A
Priority=7603 Nice=0 Account=ds007 QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:25 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2024-08-01T11:57:18 EligibleTime=2024-08-01T11:57:18
AccrueTime=2024-08-01T11:57:18
StartTime=2024-08-01T11:57:19 EndTime=2024-08-01T11:57:44 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-01T11:57:19 Scheduler=Main
Partition=devel AllocNode:Sid=d3-login01:2159102
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dnode036
BatchHost=dnode036
NumNodes=1 NumCPUs=128 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=128,node=1,billing=128
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/dirac3/home/dc-crou1/job-sched-testing/specific-job.sh
WorkDir=/lustre/dirac3/home/dc-crou1/job-sched-testing
StdErr=/lustre/dirac3/home/dc-crou1/job-sched-testing/slurm-309281.out
StdIn=/dev/null
StdOut=/lustre/dirac3/home/dc-crou1/job-sched-testing/slurm-309281.out
Power=
In particular, we can see:
- As we might expect the
JobState
isRUNNING
, although it may bePENDING
if waiting to be assigned to a node, or if we weren’t fast enough running thescontrol
command it might beCOMPLETED
- How long the job has run for (
RunTime
), and the job’s maximum specified duration (TimeLimit
) - The job’s
SubmitTime
, as well as the job’sStartTime
for execution: this may be the actual start time, or the expected start time if set in the future. The expectedEndTime
is also specified, although if it isn’t specified directly in the job script this isn’t always exactlyStartTime
+ specified duration; it’s often rounded up, perhaps to the nearest minute. - The queue assigned for the job is the
devel
queue, and that the job is running on thednode036
node - The resources assigned to the job are a single node (
NumNodes=1
) with 128 CPU cores, for a single task with 1 CPU core per task. Note that in this case we got more resources in terms of CPUs than what we asked for. For example in this instance, we can see that we actually obtained a node with 128 CPUs (although we won’t use them) - We didn’t specify a working directory within which to execute the job, so the default
WorkDir
is our home directory - The error and output file locations, as specified by
StdErr
andStdOut
You can find more information on scontrol
and the various fields in the Slurm documentation.
Note that typing scontrol
on its own will enter an interactive state with an scontrol
prompt which allows
you to enter subcommands (like show
or update
) directly to the scontrol
command. You can exit by typing quit
.
Let’s Wait a Bit…
It may be useful to defer the scheduling of a job until some point in the future. We can specify a delayed start time (or even delayed start date) by using the
--begin
parameter in our job script. For example, adding#SBATCH --begin=16:00
will delay starting the job until 16:00 (assuming there is an appropriate node available at that time). We can also specify a relative delay time, for example passingnow+1hour
instead.Launch the job again, but this time specify a start time of 1 minute in the future in the job script. How does the
scontrol
output change for a delayed job?Solution
First add
#SBATCH --begin=now+1minute
to the parameters in the job script, and relaunch usingsbatch
.You should see in the job’s
scontrol
output:
JobState=PENDING
,RunTime=00:00:00
, andAccrueTime
isUnknown
, since it’s waiting to run and no runtime has yet elapsedEligibleTime
will specify the calculated start time one minute into the future
Querying Job Resources
For more detail about jobs and the resources, there are two key Slurm commands we can use to give us more information about jobs
that are either actively running (using sstat
) or completed (using sacct
).
Querying Running Jobs
First, submit our previous job script specific-job.sh
using sbatch
,
then check when it’s actually running using squeue
(since sstat
will only work on actively running jobs),
and then when we know it’s running we can use sstat
to query which resources it’s using.
Note that it’s probably a good idea to increase the 25
value for sleep
in the job script
(along with the #SBATCH --time
parameter) so you have more time to query the job before it completes!
For example, following submission via sbatch
:
squeue -j 6803898
Wait until it reaches the running state designated by R
:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6803898 cosma7-pa hello-wo yourUser R 0:00 1 (Priority)
Since we now know it’s running, we may query its active resource usage to date.
sstat -j 6803898
The first line of output is presented as a series of columns representing the various properties - or fields - associated with the job, but unfortunately you’ll notice that with so many columns it renders the output quite unreadable! Fortunately we can select which columns we wish to see, e.g.:
sstat -j 6803898 --format=JobID,AveCPU,NTasks,MinCPU,MaxDiskWrite
So here, we’ve selected a number of field columns we’re interested in:
JobID
- the ID of the job, useful if we happen to specify multiple jobs to query, e.g.-j 6803898,6803899,6803900
.AveCPU
- the average CPU time of the jobNTasks
- total number of tasks in a jobMinCPU
- the minimum CPU time used in a jobMaxDiskWrite
- the maximum number of bytes written by the job
But looking at the output, you’ll notice there is no information is returned other than the field headers:
JobID AveCPU NTasks MinCPU MaxDiskWrite
------------ ---------- -------- ---------- ------------
This is because, by default, all jobs are actually made up of a number of job steps, and batch scripts running by themselves aren’t accounted for.
A job that consists of only a batch script, like this one, typically only has two steps:
- batch step, which is created for all jobs submitted with
sbatch
- extern step, which accounts for resources used by a job outside of aspects managed by Slurm (which includes special cases such as SSH-ing into the compute node, which may be allowed on some systems and would be accounted here)
A job may also have additional steps,
for example with MPI jobs there will be extra processes launched which are accounted for as separate steps,
and will be listed separately.
In our case, all processing is accounted for in the batch
step, so we need to query that directly,
since by default, sstat
doesn’t include this.
We can reference the batch step directly using:
sstat -j 6803898.batch --format=JobID,AveCPU,NTasks,MinCPU,MaxDiskWrite
And now we should see something like the following (you may need to submit it again):
JobID AveCPU NTasks MinCPU MaxDiskWrite
------------ ---------- -------- ---------- ------------
6803898.batch 00:00:00 1 00:00:00 66
In this trivial case, so far we can see that we’ve used virtually no actual CPU, and the batch submission script has written a total of 66 bytes.
There’s also the --allsteps
flag to see a summary of all
steps, although note that this isn’t always supported on all Slurm systems:
sstat --allsteps -j 6803898 --format=JobID,AveCPU,NTasks,MinCPU,MaxDiskWrite
So in our batch case, we’d see two steps:
JobID AveCPU NTasks MinCPU MaxDiskWrite
------------- ---------- -------- ---------- ------------
6803898.exte+ 00:00:00 1 00:00:00 1.00M
6803898.batch 00:00:00 1 00:00:00 66
What About other Format Fields?
We’ve looked at some example fields to include in the output format, but there are many others that you may wish to consider, which you can find in the Slurm documentation.
Another way to get help on the field formats is to use
sstat --helpformat
, which handily gives you a list of all fields you can use.
Querying Completed Jobs
Once a job is complete, since it only operates on active jobs you’ll notice that sstat
no longer presents resource usage.
For jobs that have completed, we use sacct
, which displays accounting data for Slurm jobs.
We can first use sacct
to list the completed jobs for today:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
6803854 bash cosma7-pa+ yourAccou+ 28 TIMEOUT 0:0
6803854.ext+ extern yourAccou+ 28 COMPLETED 0:0
6803854.0 bash yourAccou+ 28 COMPLETED 0:0
6803898 specific-+ cosma7-pa+ yourAccou+ 28 COMPLETED 0:0
6803898.bat+ batch yourAccou+ 28 COMPLETED 0:0
6803898.ext+ extern yourAccou+ 28 COMPLETED 0:0
6804261_1 hello_wor+ cosma7-pa+ yourAccou+ 28 COMPLETED 0:0
6804261_1.b+ batch yourAccou+ 28 COMPLETED 0:0
6804261_1.e+ extern yourAccou+ 28 COMPLETED 0:0
6804261_2 hello_wor+ cosma7-pa+ yourAccou+ 28 COMPLETED 0:0
6804261_2.b+ batch yourAccou+ 28 COMPLETED 0:0
6804261_2.e+ extern yourAccou+ 28 COMPLETED 0:0
6804261_3 hello_wor+ cosma7-pa+ yourAccou+ 28 COMPLETED 0:0
6804261_3.b+ batch yourAccou+ 28 COMPLETED 0:0
6804261_3.e+ extern yourAccou+ 28 COMPLETED 0:0
Or, for a specific job:
sacct -j 6803898
Which will display accounting information for a given job, including its constituent steps
(and we don’t need to use --allsteps
to collate this information).
By default, we’ll see something like the following:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
6803898 specific-+ cosma7-pa+ yourAccou+ 28 COMPLETED 0:0
6803898.bat+ batch yourAccou+ 28 COMPLETED 0:0
6803898.ext+ extern yourAccou+ 28 COMPLETED 0:0
As with sstat
, we are also able to customise the fields we wish to see, e.g.:
sacct -j 6803898 --format=JobID,JobName,Partition,Account,AllocCPUS,State,Elapsed,CPUTime
JobID JobName Partition Account AllocCPUS State Elapsed CPUTime
------------ ---------- ---------- ---------- ---------- ---------- ---------- ----------
6803898 specific-+ cosma7-pa+ yourAccou+ 28 COMPLETED 00:00:25 00:11:40
6803898.bat+ batch yourAccou+ 28 COMPLETED 00:00:25 00:11:40
6803898.ext+ extern yourAccou+ 28 COMPLETED 00:00:25 00:11:40
As with sstat
, you can add many other fields too,
although note the accounting data presented for these will be different depending on the HPC system’s configuration and the job’s type.
CPUTime here is equal to AllocCPUs * Elapsed
, i.e. the number of CPUs allocated to the job multiplied by the total elapsed time.
So Much to Remember!
Slurm has a great many commands - as well as command arguments - and it can prove difficult to remember them all. It’s often helpful to make a note of those you commonly use, or make use of a pre-made reference/cheat sheet such as this one which has a comprehensive list of commands and and their arguments with helpful short descriptions.
Key Points
Use
--nodes
and--ntasks
in a Slurm batch script to request the total number of machines and CPU cores you’d like for your job.A typical job passes through pending, running, completing, and completed job states.
Reasons for a job’s failure include being out of memory, being suspended, cancelled, or exiting with a non-zero exit code.
Determinig backfill windows allows us to determine when we may make use of idle resources.
Use
scontrol
to present detailed information concerning a submitted job.Use
sstat
to query the used resources for an actively running job.Use
sacct
to query the resource accounting information for a completed job.
Accessing System Resources using Modules
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How can I make use of HPC system resources such as compilers, libraries, and other tools?
What are HPC modules, and how do I use them?
Objectives
Load and use a software package.
Explain how the shell environment changes when the module mechanism loads or unloads packages.
Use modules in a job script.
On a high-performance computing system, it is seldom the case that the software we want to use - things like compilers and libraries - is available when we log in. It is installed, but we will need to “load” it before it can run.
Before we start using individual software packages, however, we should understand the reasoning behind this approach. The three biggest factors are software incompatibilities, versioning, and dependencies.
Software incompatibility is a major headache for programmers. Sometimes the
presence (or absence) of a software package will break others that depend on
it. Two of the most famous examples are Python 2 and 3 and C compiler versions.
Python 3 famously provides a python
command that conflicts with that provided
by Python 2. Software compiled against a newer version of the C libraries and
then used when they are not present will result in a nasty 'GLIBCXX_3.4.20'
not found
error, for instance.
Software versioning is another common issue. A team might depend on a certain package version for their research project - if the software version was to change (for instance, if a package was updated), it might affect their results. Having access to multiple software versions allow a set of researchers to prevent software versioning issues from affecting their results.
Dependencies are where a particular software package (or even a particular version) depends on having access to another software package (or even a particular version of another software package). For example, the VASP materials science software may depend on having a particular version of the FFTW (Fastest Fourier Transform in the West) software library available for it to work.
Environment Modules
Environment modules are the solution to these problems. A module is a self-contained description of a software package – it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.
There are a number of different environment module implementations commonly
used on HPC systems: the two most common are TCL modules and Lmod. Both of
these use similar syntax and the concepts are the same so learning to use one
will allow you to use whichever is installed on the system you are using. In
both implementations the module
command is used to interact with environment
modules. An additional subcommand is usually added to the command to specify
what you want to do. For a list of subcommands you can use module -h
or
module help
. As for all commands, you can access the full help on the man
pages with man module
.
On login you may start out with a default set of modules loaded or you may start out with an empty environment; this depends on the setup of the system you are using.
Listing Available Modules
To see available software modules, use module avail
.
module avail
On COSMA, it looks something like the following, although your site output will differ:
--------------------- /cosma/local/Modules/modulefiles/mpi ---------------------
hpcx-mt/2.2 intel_mpi/2020 openmpi/4.0.3
intel_mpi/2017 intel_mpi/2020-update1 openmpi/4.0.5
intel_mpi/2018 intel_mpi/2020-update2 openmpi/4.1.1
intel_mpi/2019 mvapich2_mpi/2.3.6 openmpi/4.1.1.no-ucx
intel_mpi/2019-update1 mvapich2_mpi/2.3.6-debug openmpi/4.1.4
intel_mpi/2019-update2 mvapich2_mpi/2.3.7-1 openmpi/4.1.4-romio-lustre
intel_mpi/2019-update3 openmpi/3.0.1(default) openmpi/20190429
intel_mpi/2019-update4 openmpi/4.0.1 rockport-settings
------------------ /cosma/local/Modules/modulefiles/compilers ------------------
aocc/1.3.0 intel_comp/2019-update2
aocc/2.0.0 intel_comp/2019-update3
aocc/2.2.0 intel_comp/2019-update4
...
What About Partial Matches?
A useful feature of
module avail
is that it also works on partial matches that begin with a given argument. For example,module avail x
would display a shortened list of any modules beginning withx
. This is handy if you need to search for a particular module but can’t remember the full name, or would like a succinct list of all versions of a particular module.Using
module avail
, how many versions ofopenmpi
are on your HPC system?Solution
Typing
module avail openmpi
on DiRAC’s COSMA HPC resource, at the time of writing we get:--------------------- /cosma/local/Modules/modulefiles/mpi --------------------- openmpi/3.0.1(default) openmpi/4.0.5 openmpi/4.1.4 openmpi/4.0.1 openmpi/4.1.1 openmpi/4.1.4-romio-lustre openmpi/4.0.3 openmpi/4.1.1.no-ucx openmpi/20190429
So, a total of 9 module versions of
openmpi
. On Tursa:--------------------------------------- /home/y07/shared/tursa-modules --------------------------------------- openmpi/4.1.5 ------------------------------ /mnt/lustre/tursafs1/apps/cuda-12.3-modulefiles ------------------------------- openmpi/4.1.5-cuda12.3 ----------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles ------------------------------ openmpi/4.1.1-cuda11.4.1 ----------------------------------- /mnt/lustre/tursafs1/apps/modulefiles ------------------------------------ openmpi/4.0.4 openmpi/4.1.1
On DiAL3:
------------------------------------------- /cm/shared/modulefiles ------------------------------------------- openmpi4/intel/4.0.5
And on CSD3, we have something like:
-------------------------------------- /usr/local/software/modulefiles --------------------------------------- openmpi/3.1.4-gcc-7.2.0 ----------------------------------- /usr/local/Cluster-Config/modulefiles ------------------------------------ openmpi-GDR/gnu/1.10.7_cuda-8.0 openmpi/gcc/9.2/4.0.1 openmpi/gcc/9.3/4.0.4 openmpi-GDR/gnu/2.1.1_cuda-8.0 openmpi/gcc/9.2/4.0.2 openmpi/pgi/3.0.0
Listing Currently Loaded Modules
You can use the module list
command to see which modules you currently have
loaded in your environment. If you have no modules loaded, you will see a
message telling you so
module list
Depending on your system, you may find something like the following:
Currently Loaded Modulefiles:
1) cosma/2018 3) armforge/22.0.2 5) gadgetviewer/1.1.3
2) python/2.7.15(default) 4) hdfview/3.1.4 6) utils/201805
Depending on your site, you may find it returns with much a shorter list, or perhaps No Modulefiles Currently Loaded.
More or Less Information?
Using the
-l
switch withmodule list
will give you more information about those modules loaded; namely, any additional version information for each module loaded and the last date/time the module was modified on the system. Conversely, using the-t
switch will give you the output in a terse format, as a simple list of modules one per line.These switches also work with
avail
. Using the-l
switch with this command, determine the date a particular version of a module (such as openmpi or Python) was modified.Solution
For example, using
module avail -l openmpi/4.1.4
on COSMA at time of writing, we get:- Package/Alias -----------------------.- Versions --------.- Last mod. ------- /cosma/local/Modules/modulefiles/mpi: openmpi/4.1.4 2022/11/28 11:11:31 openmpi/4.1.4-romio-lustre 2022/09/14 10:48:34
Using
module avail -l openmpi/pgi/3.0.0
on CSD3, we get:- Package/Alias -----------------------.- Versions --------.- Last mod. ------- /usr/local/Cluster-Config/modulefiles: openmpi/pgi/3.0.0 2018/05/17 14:25:11
Loading and Unloading Software
To gain or remove access to the typically numerous software modules we have available to us on an HPC system, we load or unload them.
Loading Software
To load a software module, we use module load
.
Whilst the DiRAC sites have some modules in common, there are many differences in what software modules are available and not all modules are available on all sites. So in this example, for simplicity whilst investigating module loading, we’ll load a different module depending on your site (so make a note of it!):
- Durham COSMA:
julia
- Edinburgh Tursa:
cmake
- Leicester DiAL3:
ffmpeg
- Cambridge CSD3:
bison
We won’t use or investigate any of the packages in any detail, but merely use them to demonstrate the use of modules. They’re handy for training purposes, since the module names equate to the commands used to run them. Note that some of these commands are actually available on multiple sites across DiRAC.
Initially, our module is not loaded. We can test this by using the which
command.
which
looks for programs the same way that Bash does, so we can use
it to tell us where a particular piece of software is stored.
So on DiAL3, we could do the following:
which ffmpeg
On your own site, substitute ffmpeg
with the module above for your site.
You’ll likely get something like the following, complaining that it can’t find the command within our environment:
/usr/bin/which: no ffmpeg in (/cm/local/apps/lua/5.4.0/bin:/home/dc-crou1/.local/bin:/home/dc-crou1/bin:/cm/shared/apps/hwloc/1.11.11/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
So we can now try to load our module with module load
,
so for DiAL3, for example:
module load ffmpeg
which ffmpeg
Which now shows us, in the case of DiAL3 and ffmpeg
:
/cm/shared/apps/ffmpeg/5.0.1/bin/ffmpeg
Why Not Specify the Version of the Module?
Note here we aren’t specifying the precise version of the module that we want for simplicity here. However, feel free to use
module avail <module_name>
to determine the versions available on your HPC system and then load a specific version if you wish, e.g.module load julia/1.9.1
At some point or other, you will run into issues where only one particular version of some software will be suitable. Perhaps a key bugfix only happened in a certain version, or version X broke compatibility with a file format you use. In either of these example cases, it helps to be very specific about what software is loaded.
So, what just happened?
To understand the output, first we need to understand the nature of the $PATH
environment variable. $PATH
is a special environment variable that controls
where a UNIX system looks for software. Specifically $PATH
is a list of
directories (separated by :
) that the OS searches through for a command
before giving up and telling us it can’t find it. As with all environment
variables we can print it out using echo
.
echo $PATH
On COSMA (with the Julia module loaded) this looks like:
/cosma/local/julia/1.9.1:/cosma/local/matlab/R2020b/bin:/cosma/local/gadgetviewer/1.1.4/bin:/cosma/local/hdfview/HDFView/3.1.4/bin:/cosma/local/arm/forge/22.0.2/bin:/cosma/local/Python/2.7.15/bin:/cosma/local/bin:/usr/lib64/qt-3.3/bin:/cosma/local/Modules/default/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
You’ll notice a similarity to the output of the which
command. In this case,
there’s only one difference: the different directory at the beginning. When we
ran the module load
command, it added a directory to the beginning of our
$PATH
.
Let’s examine what’s there (your particular path will differ depending on your site and the command):
ls /cosma/local/julia/1.9.1
So for Julia’s directory location on COSMA, it looks like:
base contrib etc LICENSE.md README.md test VERSION
CITATION.bib CONTRIBUTING.md HISTORY.md Makefile src THIRDPARTY.md
CITATION.cff deps julia Make.inc stdlib usr
cli doc julia.spdx.json NEWS.md sysimage.mk usr-staging
Taking this to its conclusion, module load
will therefore add software to your $PATH
,
which is what is meant by loading software: we are essentially changing our command line environment
so we are able to make use of the software.
What About Loading Dependencies?
A special note on this - depending on which version of the
module
program that is installed at your site,module load
may also load required software dependencies as well, or make specific mention that other modules need to be loaded beforehand.To demonstrate, on DiRAC’s COSMA resource, let’s assume we want to load a particular version of OpenMPI:
module load openmpi/4.1.4
In this case, at the time of writing we get the following:
A compiler must be chosen before loading the openmpi module. Please load one of the following compiler modules: aocc_comp/4.0.0 gnu_comp/11.1.0 gnu_comp/13.1.0 gnu_comp/9.3.0 intel_comp/2022.1.2 intel_comp/2022.3.0
So here, we need to explicitly load one of these compiler options before we are able to load OpenMPI. e.g.
module load gnu_comp/13.1.0
. Depending on your system and how it’s configured, your mileage will differ!
How Loading Affects the Environment
Note that this module loading process happens principally through
the manipulation of environment variables like $PATH
. There
is usually little or no data transfer involved.
The module loading process manipulates other special environment variables as well, including variables that influence where the system looks for software libraries, and sometimes variables which tell commercial software packages where to find license servers.
The module command also restores these shell environment variables to their previous state when a module is unloaded.
If we need such detail, we are able to see the changes that would be made to our environment using module display
.
For example, on Tursa with cmake
:
module display cmake
-------------------------------------------------------------------
/home/y07/shared/tursa-modules/cmake/3.27.4:
conflict cmake
prepend-path PATH /home/y07/shared/utils/core/cmake/3.27.4/bin
prepend-path CPATH /home/y07/shared/utils/core/cmake/3.27.4/include
prepend-path LD_LIBRARY_PATH /home/y07/shared/utils/core/cmake/3.27.4/lib
prepend-path MANPATH /home/y07/shared/utils/core/cmake/3.27.4/man
-------------------------------------------------------------------
So here, we can see that loading version 3.27.4 of cmake
will add /home/y07/shared/utils/core/cmake/3.27.4/bin
to the start of our path.
We can also see that it adds /home/y07/shared/utils/core/cmake/3.27.4/man
to a variable called $MANPATH
,
which is a specific path that contains locations of additional software manual pages we can access.
Once cmake
is loaded, we are thus able to then use man cmake
to access its manual page,
which is really useful in general for seeing information about commands, their parameters, and how to use them.
Loading Multiple Versions of the Same Module?
You may ask so what if we load multiple versions of the same module? Depending on how your system is configured, this may be possible, e.g. on COSMA:
module load julia/1.9.1 module load julia/1.5.3
In some cases, you may encounter incompatibility dependency conflicts, particularly with underlying libraries. However you may not see any error at all, which could give rise to confusion. One way around this would be to exit your current terminal session and reconnect to the HPC resource which will reset your environment. But what about within the same login session? To remedy this, see the next section for how to unload modules.
Unloading Software
Conversely, we may wish to unload modules we have previously loaded. This is useful if we no longer need to use a module, or require another version of the module. In general, it’s always good practice to unload modules you aren’t currently using.
For example, assuming we already have Julia loaded, we can unload it using, e.g. on COSMA, with julia
:
module unload julia
Depending on your site, use module unload
with the module you loaded earlier.
Note we don’t have to specify the version number. Once unloaded, our environment no longer allows us to make use of the software until we load it again.
If we want to unload all modules in our environment, we can use the module purge
command.
But we aware that this will also remove any modules that are loaded automatically by default upon login.
Using Software Modules in Scripts
We’ve so far explored how to load modules within an interactive command line session, but if we want to make use of modules in our jobs we also need to load them in our job scripts so they are loaded on compute nodes when the job runs.
Create a job that is able to show the version of the module command you loaded earlier, e.g.:
- CSD3:
bison --version
- DiAL3:
ffmpeg -version
(note it’s only using one hyphen!)- Tursa:
cmake --version
- COSMA:
julia --version
Remember, no software is loaded by default! Running a job is very similar to logging on to the system, therefore you should not assume a module loaded on the login node is loaded on a compute node.
Solution
In
version-module.sh
(again, replacingyourAccount
andaPartition
, but also replacingcmake
with the command for your site if it isn’tcmake
):#!/bin/bash -l #SBATCH --account=yourAccount #SBATCH --partition=aPartition #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --time=00:00:30 module load cmake cmake --version
sbatch version-module.sh
Key Points
HPC systems use a module loading/unloading system to provide access to software.
To see the available modules on a system, we use
module avail
.The software installed across the DiRAC sites can be different in terms of what’s installed and the versions that are available.
module list
will show us which modules we currently have loaded.We use
module load
andmodule unload
to grant and remove access to modules on the system.We should only keep loaded those modules we actively wish to use, and try to avoid loading multiple versions of the same software.
Using Different Job Types
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What types of job can I run on HPC systems?
How do I run a job that uses a node exclusively?
How can I submit an OpenMP job that makes use of multiple threads within a single CPU?
How do I submit a job that uses Message Passing Interface (MPI) parallelisation?
How can I submit the same job many times with different inputs?
How can I interactively debug a running job?
Objectives
Use a popular compiler to compile a C program before testing and submitting it.
Use a compute node for a job exclusively.
Describe how an OpenMP job parallelises its computation.
Compile and submit an OpenMP job.
Describe how an MPI job parallelises its computation.
Compile and submit an MPI job.
Highlight the key differences between OpenMP and MPI jobs.
Define and submit an array job to execute multiple tasks within a single job.
Use an interactive job to run commands remotely on a compute node.
So far we’ve learned about the overall process and the necessary “scaffolding” around job submission; using various parameters to configure a job to use resources, making use of software installed on compute nodes by loading and unloading modules, submitting a job, and monitoring it until (hopefully successful) completion. Using what we’ve seen so far, let’s take this further and look at some key types of job that we can run on HPC systems to take advantage of various types of parallelisation, using examples written in the C programming language. We’ll begin with a simple serial hello world example, and briefly explore various ways that code is parallelised and run to make best use of such systems.
Serial
With a serial job we run a single job on one node within a single process. Essentially, this is very similar to running some code via the command line on your local machine. Let’s take a look at a simple example written in C (the full code can also be found in hello_world_serial.c.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
printf("Hello world!\n");
}
After copying this into a file called hello_world_serial.c
, we can then compile and run it, e.g.:
gcc hello_world_serial.c -o hello_world_serial
./hello_world_serial
Depending on your system, you may need to preload a module to compile C (perhaps either using cc
or gcc
).
You should then see Hello world!
printed to the terminal.
Be Kind to the Login Nodes
It’s worth remembering that the login node is often very busy managing lots of users logged in, creating and editing files and compiling software, and submitting jobs. As such, although running quick jobs directly on a login node is ok, for example to compile and quickly test some code, it’s not intended for running computationally intensive jobs and these should always be submitted for execution on a compute node.
The login node is shared with all other users and your actions could cause issues for other people, so think carefully about the potential implications of issuing commands that may use large amounts of resource.
Now, given this is a very simple serial job, we might write the following hw_serial.sh
Slurm script to execute it:
#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_world_serial
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1M
./hello_world_serial
Note here we are careful to specify only what resources we think we need. In this case, a single node running a single process and very little memory (in fact, very likely a great deal less memory than that).
As before, as can submit this using sbatch hw_serial.sh
to submit it, squeue -u yourUsername
to monitor it until completion,
and then look at the slurm-<job_number>.out
file to see the Hello world!
output.
Making Exclusive use of a Node
We can use
#SBATCH --exclusive
to indicate we’d like exclusive access to the nodes we request, such that they are shared with no other jobs, regardless of how many CPUs we actually need. If we are running jobs that require particularly large amounts of memory, CPUs, or disk access, this may be important. However, as you might suspect, requesting exclusive use of a node may mean it takes some time to be allocated a whole node in which to run. Plus, as a responsible user, be careful to ensure you only request exclusive access to a node when your job needs it!
Multi-threaded via OpenMP
OpenMP allows programmers to identify and parallelize sections of code, enabling multiple threads to execute them concurrently. This concurrency is achieved using a shared memory model, where all threads can access a common memory space and communicate through shared variables.
So with OpenMP, think of your program as a team with a leader (the master thread) and workers (the slave threads). When your program starts, the leader thread takes the lead. It identifies parts of the code that can be done at the same time and marks them, and these marked parts are like tasks to be completed by the workers. The leader then gathers a group of helper threads, and each helper tackles one of these marked tasks. Each worker thread works independently, taking care of its task. Then, once all the workers are done, they come back to the leader, and the leader continues with the rest of the program.
Let’s look at a parallel version of hello world, which launches a number of threads. You can find the code below in hello_world_omp.c.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
int num_threads, t;
int *results;
// Use an OpenMP library function to get the number of threads
num_threads = omp_get_max_threads();
// Create a buffer to hold the integer results from each thread
results = malloc(sizeof(*results) * num_threads);
// In parallel, within each thread available, store the thread's
// number in our shared results buffer
#pragma omp parallel shared(results)
{
int my_thread = omp_get_thread_num();
results[my_thread] = my_thread;
}
for (t = 0; t < num_threads; t++)
{
printf("Hello world thread number received from thread %d\n", t);
}
}
OpenMP makes use of compiler directives to indicate which sections we wish to run in parallel worker threads on a single CPU. Compiler directives are special comments that are picked up by the C compiler and tell the compiler to behave a certain way with the code being compiled.
How Does it Work?
In this example we use the
#pragma omp parallel
OpenMP compiler directive around a portion of the code, so each worker thread will run this in parallel. The number of threads that will run is set by the system and obtained usingomp_get_max_threads()
.We also need to be clear how variables behave in parallel sections, in particular to what extent they are shared between threads or private to each thread. Here, we indicate that the
results
array is shared and accessible across all threads within this parallel code portion, since in this case we want each worker’s thread to add its thread number to our shared array.Once this parallelised section’s worker threads are complete, the program resumes a serial, single-threaded behaviour within the master thread, and outputs the results array containing all the worker thread numbers.
Now before we compile and test it, we need to indicate how many threads we wish to run, which is specified in the environment in a special variable and picked up by the program, so we’ll do that first:
export OMP_NUM_THREADS=3
gcc hello_world_omp.c -o hello_world_omp -fopenmp
./hello_world_omp
And we should see the following:
Hello world thread number received from thread 0
Hello world thread number received from thread 1
Hello world thread number received from thread 2
If we wish to submit this as a job to Slurm, we also need to write a submission script to run it that reflects this is an OpenMP job,
so let’s put the following in a file called hw_omp.sh
:
#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_world_omp
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --mem=50K
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
./hello_world_omp
So here we’re requesting a single node (--nodes=1
) running a single process (--ntasks=1
),
and that we’ll need three CPU cores (--cpus-per-task=3
) - each of which will run a single thread.
Next, we need to set OMP_NUM_THREADS
as before,
but here we set it to a special Slurm environment variable (SLURM_CPUS_PER_TASK
) that is set by Slurm to hold
the --cpus-per-task
we originally requested (3
).
We could have simply set this to three, but this method ensures that the number of threads that will run will match whatever we requested.
So if we change this request value in the future, we only need to change it in one place.
If we submit this script using sbatch hw_omp.sh
, you should see something like the following in the Slurm output file:
Hello world thread number received from thread 0
Hello world thread number received from thread 1
Hello world thread number received from thread 2
Multi-process via Message Passing Interface (MPI)
Our previous example used multiple threads (i.e. parallel execution within a single process). Let’s take this parallelisation one step further to the level of a process, where we run separate processes in parallel as opposed to threads.
At this level, things become more complicated! With OpenMP we had the option to maintain access to variables across our threads, but between processes, memory isn’t shared, so if we want to share information between these processes we need another way to do it. MPI uses a distributed memory model, so communication is done via sending and receiving messages between processes.
Now despite this inter-process communication being a greater overhead, in general our master/worker model still holds. In MPI, from the outset, when an MPI-enabled program is run, we have a number of processes executing in parallel. Each of these processes is referred to as a rank, and one of these ranks (typically rank zero) is a coordinating, or master, rank.
So how does this look in a program? You can find the code below in hello_world_mpi.c.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char** argv) {
int my_rank, n_ranks;
int *resultbuf;
int r;
MPI_Init(&argc, &argv);
// Obtain the rank identifier for this process, and the total number of ranks
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);
// Create buffer to hold rank numbers received from all ranks
// This will include the coordinating rank (typically rank 0),
// which also does the receiving
resultbuf = malloc(sizeof(*resultbuf) * n_ranks);
// All ranks send their rank identifier to rank 0
MPI_Gather(&my_rank, 1, MPI_INT, resultbuf, 1, MPI_INT, 0, MPI_COMM_WORLD);
// If we're the coordinating rank (typically designated as rank 0),
// then print out the rank numbers received
if (my_rank == 0) {
for (r = 0; r < n_ranks; r++) {
printf("Hello world rank number received from %d\n", resultbuf[r]);
}
}
MPI_Finalize();
}
This program is a fair bit more complex than the OpenMP one, since here we need to explicitly coordinate the sending and receiving of messages and do some housekeeping for MPI itself, such as setting up MPI and shutting it down.
How Does it Work?
After initialising MPI, in a similar vein to how we got the number of threads and threads identity, we obtain the number of total ranks (processses) and our rank number. Now in this example, for simplicity, we use a single MPI function
MPI_Gather
to simultaneously send the rank numbers from each separate process to the coordinating rank. Essentially, are sendingmy_rank
(asMPI_INT
, basically an integer) to rank0
, which receives all responses, including its own, withinresultbuf
. Finally, if the rank is the coordinating rank, then the results are output. Theif (my_rank == 0)
condition is important, since without it, all ranks would attempt to print the results, since with MPI, typically all processes run the entire program.
Let’s compile this now. First, we may need to load some module to provide us with the correct compiler and an implementation of MPI, so we can compile our MPI code.
On DiRAC’s COSMA, this looks like:
module load gnu_comp
module load openmpi
We can also load specific versions if we wish:
module load gnu_comp/13.1.0
module load openmpi/4.1.4
Note that on many sites there are often a number of compiler and MPI implementation options,
but for the purposes of this training we’ll use openmpi
with a readily available C compiler
(or what is provided by the system by default).
Other DiRAC Sites?
On Cambridge’s CSD3 and Edinburgh’s Tursa:
module load openmpi
On Leicester’s DiAL3 we need to do something more specific, such as:
module load gcc/10.3.0/picedk module load openmpi/4.1.6/ol2kfe
Once we’ve loaded these modules we can compile this code:
mpicc hello_world_mpi.c -o hello_world_mpi
So note we need to use a specialised compiler, mpicc
, to compile this MPI code.
Now we’re able to run it, and specify how many separate processes we wish to run in parallel.
However, since this is a multi-processing job we should submit it via Slurm.
Let’s create a new submission script named hw_mpi.sh
that executes this MPI job:
#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_world_mpi
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --mem=1M
module load openmpi
mpiexec -n ${SLURM_NTASKS} ./hello_world_mpi
On this particular HPC setup on DiRAC’s COSMA, we need to load the OpenMPI module so we can run MPI jobs.
For your particular site, substitute the MPI module load
command you used previously.
Note also that we specify 3
to --ntasks
this time to reflect the number of processes we wish to run.
You’ll also see that with MPI programs we use mpiexec
to run them,
and specifically state the number of MPI processes we specified in --ntasks
by using the Slurm environment variable SLURM_NTASKS
,
so mpiexec
will use 3 processes in this case.
Efficient use of Resources
Beware using
--ntasks
correctly when submitting non-MPI jobs. Setting this to a number greater than 1 will have the effect of running the job that many times, regardless of whether it’s using MPI or not, so It also has the effect of multiple jobs overwriting, as opposed to appending to, the jobs’ output log file, so the fact this has happened may not be obvious.
In the Slurm output file You should see something like:
Hello world rank number received from rank 0
Hello world rank number received from rank 1
Hello world rank number received from rank 2
Array
So we’ve seen how parallelisation can be achieved using threads and processes, but using a sophisticated job scheduler like Slurm, we are able to go a level higher using job arrays, where we specify how many separate jobs (as tasks) we want running in parallel instead.
One way we might do this is using a simple for loop within a Bash script to submit multiple jobs. For example, to submit three jobs, we could do:
for JOB_NUMBER in {1..3}; do
sbatch a-script.sh $JOB_NUMBER
done
In certain circumstances this may be a suitable approach, particularly if each task differs substantially, and you need to control over each subtask explicitly or change configuration resource request parameters for each job task.
However, job arrays have some unique advantages over this approach;
namely that job arrays are self-contained, so each array task is linked to the same job submission ID,
which means we can use sacct
and squeue
to query them as a whole submission.
Plus, we need no additional code to make it work, so it’s generally a simpler way to do it.
To make use of a job array approach in a Slurm we add an additional --array
parameter to our submission script.
So let’s create a new hello_job_array.sh
script that uses it:
#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_job_array
#SBATCH --array=1-3
#SBATCH --output=output/array_%A_%a.out
#SBATCH --error=err/array_%A_%a.err
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1M
echo "Task $SLURM_ARRAY_TASK_ID"
grep hello input/input_file_$SLURM_ARRAY_TASK_ID.txt
sleep 30
We’ve introduced a few new things in this script:
#SBATCH --array=1-3
- this job will create three array tasks, with task identifiers1
,2
, and3
.#SBATCH --output=output/array_%A_%a.out
- this explicitly specifies what we want our output file to be called and where it should be located, and will collect any output printed to the console from the job.%A
will be replaced with the overall job submission id, and%a
will be replaced by the array task id. So, assuming a job id of 123456, the first array task will have an output file calledarray_123456_1.out
which will be stored in theoutput
directory. Naming the output files in this way separates them by job id as well as array task id.#SBATCH --error=err/array_%A_%a.err
- similarly to--output=
, this will store any error output for this array task (i.e. messages output to the standard error), in the specified error file in theerr
directory.$SLURM_ARRAY_TASK_ID
is a shell environment variable that holds the number of the individual array task running.grep hello input/input_file_$SLURM_ARRAY_TASK_ID.txt
here we use thegrep
command to search for the word “hello” withan input file with the filenameinput_file_$SLURM_ARRAY_TASK_ID.txt
in theinput
directory, where$SLURM_ARRAY_TASK_ID
will be replaced with the array id. For example, for the first array task, the input file will be calledinput_file_1.txt
. We’ve usedgrep
as an example command, but this technique can be applied to any program that accepts inputs in this way.
Given the jobs are trivial and finish very quickly,
we’ve added a sleep 30
command at the end so each task takes an additional 30 seconds to run,
so that you should be able to see the array job in the queue before it disappears from the list.
Before we submit this job, we need to prepare some input and output directories and some input files for it to use.
mkdir input
mkdir output
mkdir err
In the input
directory, make some text files with the filenames input_file_1.txt
, input_file_2.txt
, and input_file_3.txt
with some text that somewhere includes hello
in it, e.g.
echo "hello there my friend" > input/input_file_1.txt
echo "hello world" > input/input_file_2.txt
echo "well hello, can you hear me?" > input/input_file_3.txt
Separating Input and Output Using Directories
A common technique for structuring where input and output should be located is to have them in separate directories. This is an established practice that ensures that input and output are kept separate for a computational process, and therefore cannot easily be confused. It can be tempting to just have all the files in a single directory, but when running multiple jobs with potentially multiple inputs and outputs, things can quickly become unmanageable!
If we now submit this job with sbatch
and then use squeue -j jobID
we should see something like the following,
a single entry but with [1-3]
in the JOBID
indicating the three subtasks as part of this job:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6803105_[1-3] cosma7-pa hello_jo yourUser PD 0:00 1 (Priority)
Once complete, we’ll find three separate job output log files: 6803105_1
, 6803105_2
, and 6803105_3
,
each corresponding to a specific task in the output
directory.
For example for 6803105_1
, depending on our HPC resource we will see something like:
Task 1
Hello world!
Canceling Array Jobs
We are able to completely cancel an array job and all its tasks using
scancel jobID
as with a normal job. With the job above,scancel 6803105
would do this, as wouldscancel 6803105_[1-3]
, where we indicate explicitly all the tasks in the range1-3
. But we’re able to cancel individual tasks as well using, for example,scancel 6803205_1
for the first task.Submit the array job again and use
scancel
to cancel only the second and third tasks.Solution
scancel 6803205_2 scancel 6803205_3
Or:
scancel 6803205_[2-3]
As with cancelling a normal job, the
slurm-
output log files for tasks will still be produced containing any output up until the point the tasks are cancelled.
Interactive
We’ve seen that we can use the login nodes on an HPC resource to test our code (in a small way) before we submit it. But sometimes when developing more complex code it would be useful to have access in some way to the compute nodes themselves, particularly to explore or debug an issue. For this we can use interactive jobs.
By reserving a compute note explicitly for this purpose, an interactive job will grant us an interactive session on a compute node that meets our job requirements, although of course, as with any job, this may not be granted immediately! Then, once the interactive session is running, we are able to enter commands and have their output visible on our terminal as if we had direct access to the compute node.
To submit a request for an interactive job where we wish to reserve a single node and two cores,
we can use Slurm’s srun
command:
srun --account=yourAccount --partition=aPartition --nodes=1 --ntasks-per-node=2 --time=00:10:00 --pty /bin/bash
So as well as the account/partition and the number of nodes and cores, we are requesting 10 minutes of interactive time (after which the interactive job will exit), and that the job will run a Bash shell on the node which we’ll use to interact with the job.
You should see the following, indicating the job is waiting, and then hopefully soon afterwards, that the job has been allocated a suitable node:
srun: job 5608962 queued and waiting for resources
srun: job 5608962 has been allocated resources
[yourUsername@m7443 ~]$
At this point our interactive job is running our Bash shell remotely on compute node m7443.
We can also verify that we are on a compute node by entering hostname
,
which will return the host name of the compute node on which the job is running.
At this point, we are able to use the module
, srun
and other commands
as they might appear within our job submission scripts:
[yourUsername@m7443 ~]$ srun --ntasks=2 ./hello_world_mpi
Hence, if this MPI code were faulty, as we encounter issues we have the opportunity to diagnose them in real time, fix them, and re-run our code to test it again.
When you wish to exit your session use exit
, or Ctrl-D
.
You can check that your session is completed using squeue -j
with the job ID as normal.
Interactive Sessions: Watch your Budget!
Importantly, note that whilst the terminal is active your allocation is consuming budget, just as with a normal job, so be very aware to not leave an interactive session idle!
Combined Power (with a Note of Caution)
For many job types it may also make sense to combine them together. As we’ve seen we’re able to run MPI jobs on a compute node over an interactive session, and one very powerful approach to parallelism involves using both multi-threaded (OpenMP) and multi-process (MPI) at the same time (known as hybrid OpenMP/MPI). and it’s very possible to also configure jobs to make use of job arrays with these approaches too.
As you may imagine, using multiple approaches offers tremendous flexibility and power to vastly scale what you are able to accomplish, although it’s worth remembering that these also have the tradeoff of consuming more resources, and thus more budget, at a commensurate rate. When running applications (or developing applications to run) on HPC resources it’s therefore strongly recommended to first start with small, simple jobs until you have high confidence their behaviour is correct.
Key Points
The login node is shared with all other users so be careful not to run complex programs on them.
OpenMP allows programmers to specify sections of code to execute in parallel threads on a single CPU, using compiler directives.
MPI programs make use of message-passing between multiple processes to coordinate communication.
OpenMP uses a shared memory model, whilst MPI uses a distributed memory model.
An array job is a self-contained set of multiple jobs - known as tasks - managed as a single job.
Interactive jobs allow us to interact directly with a compute node, so are very useful for debugging and exploring the running of code in real-time.
Interactive jobs consume resources while an interactive session is active, so we must be careful to use them efficiently.
Run and develop applications at a small scale first, before making use of powerful scaling techniques, to avoid potentially expensive consumption of resources.
Survey
Overview
Teaching: min
Exercises: minQuestions
Objectives
Key Points