Introduction to HPC Job Scheduling

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What is a job scheduler and why does a cluster need one?

How do I find out what parameters to use for my Slurm job?

How do I submit a Slurm job?

What are DiRAC project allocations and how do they work?

Objectives

Describe briefly what a job scheduler does.

Recap the fundamentals of Slurm job submission and monitoring.

Use Slurm and SAFE to determine the submission parameters to use to submit a job.

Summarise the main ways researchers may request time on a DiRAC facility.

job scheduling: A Recap

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The following illustration compares these tasks of a job scheduler to a waiter in a restaurant. If you can relate to an instance where you had to wait for a while in a queue to get in to a popular restaurant, then you may now understand why sometimes your job does not start instantly as on your laptop.

Compare a job scheduler to a waiter in a restaurant

The scheduler used here is Slurm, and although Slurm is not used everywhere, running jobs is quite similar regardless of what software is being used. The exact syntax might change, but the concepts remain the same.

Fundamentals of Job Submission and Monitoring: A Recap

You may recall from a previous course, or your own experience, two basic Slurm commands; namely sbatch to submit a job to an HPC resource, and squeue to query the state of a submitted job. When we submit a job, we typically write a Slurm script which embodies the commands we wish to run on a compute node. A job script (e.g. basic-script.sh) is typically written using the Bash shell language and a very minimal example looks something like this (save this in a file called basic-script.sh, but don’t try to submit it just yet!):

#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=1
#SBATCH --time=00:00:30

date

The #SBATCH lines are special comments that provide additional information about our job to Slurm; for example the account and partition, the maximum time we expect the job to take when running, and the number of nodes we’d like to request (in this case, just one). We’ll look at other parameters in more detail later, but let’s focus on specifying a correct set of minimal parameters first.

`--account`

It’s important to note that what you specify for the --account parameter is not your machine login username or SAFE login; it’s the project account to which you have access. Project accounts are assigned an allocation of resources (such as CPU or disk space), and to use them, you specify the project account code in the --account parameter.

You can find the projects to which you have access in your DiRAC’s SAFE account. To see them, after you login to SAFE, select Projects from the top navigation bar and select one of your projects to see further details, where you’ll find the project’s account code you can use.

`--partition`

The underlying mechanism which enables the scheduling of jobs across HPC systems like Slurm is that of the queue (or partition). A queue represents a list of (to some extent) ordered jobs to be executed on HPC compute resources, and sites often have many different queues which represent different aspects, such as the level of prioritisation for jobs, or the capabilities of compute nodes So when you submit a job, it will enter one of these queue to be executed. How these queues are set up across multi-site HPC systems such as DiRAC can differ, depending on local institutional infrastructure configurations, user needs, and site policies.

You can find out the queues available on a DiRAC site, and their current state, using:

sinfo -s

The -s flag curtails the output to only a summary, whereas omitting this flag provides a full listing of nodes in each queue and their current state.

You should see something like the following. This is an example from COSMA:

PARTITION         AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
cosma7               up 3-00:00:00     80/137/7/224 m[7229-7452]
cosma7-pauper        up 1-00:00:00     80/137/7/224 m[7229-7452]
cosma7-shm           up 3-00:00:00          0/1/0/1 mad02
cosma7-shm-pauper    up 1-00:00:00          0/1/0/1 mad02
cosma7-bench         up 1-00:00:00          0/0/1/1 m7452
cosma7-rp            up 3-00:00:00     82/140/2/224 m[7001-7224]
cosma7-rp-pauper*    up 1-00:00:00     82/140/2/224 m[7001-7224]

Here we can see the general availability of each of these queues (also known as partitions), as well as the maximum time limit for jobs on each of these queues (in days-hours:minutes:seconds format). For example, on the cosma7 queue there is a 3-day limit, whilst the cosma7-pauper queue has a 1-day limit. The * beside the partition name indicates it is the default queue (although it’s always good practice to specify this explicitly).

Queues on Other Sites

On other DiRAC sites, the queues displayed with sinfo will look different, for example:

Edinburgh’s Tursa: there are cpu and gpu queues, as well as other gpu-specific queues.

Cambridge’s CSD: there are a number of queues for different CPU and GPU architectures, such as cclake, icelake, sapphire, and ampere.

Leicester’s DiAL3: there are two queues named slurm and devel, with slurm referring to the main HPC resource and devel for development or testing use (i.e. short running jobs).

Note that the queues to which you have access will depend on your allocation setup, and this may not include the default queue (for example, if you only have access to GPU resources which are accessible on their own queue, like on Tursa, you’ll need to use one of these queues instead).

To find out more information on queues, you can use the scontrol show command, which allows you to view the configuration of Slurm and it’s current state. So to see a breakdown of a particular queue, you can do (replacing <partition_name>):

scontrol show partition=<partition_name>

So on COSMA, for example:

scontrol show partition=cosma7

An example of the output, truncated for clarity:

PartitionName=cosma7
   AllowGroups=cosma7 AllowAccounts=do009,dp004,dp012,dp050,dp058,dp060,dp092,dp121,dp128,dp203,dp260,dp276,dp278,dp314,ds007 AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=3-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=m[7229-7449]
   ...

In particular, we can see the accounts (under AllowAccounts) that have access to this queue (which may display ALL depending on the queue and the system).

To see a complete breakdown of all queues you can use:

scontrol show partitions

Other Parameters

Depending on your site and how your allocation is configured you may need to specify other parameters in your script. On Tursa for example, you may need to specify a --qos parameter in the script which stands for quality of service, which is used to constrain or modify the characteristics of a job. On other sites, a default --qos is already present and doesn’t need to be explicitly supplied.

So for example on Tursa, we can use scontrol show partition to display the allowed QoS’ for a particular queue, e.g.:

scontrol show partition=cpu

PartitionName=cpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=sysadm,standard,high,short,debug,low,reservation
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=tu-c0r0n[66-71]
   ...

We can see under AllowQos those that are permitted. So for example using the cpu queue on Tursa, in the batch script you may need to add a line containing #SBATCH --qos=standard (e.g. below the other #SBATCH directives) for jobs to work - make a note of this!

Submitting a Job!

Let’s check that we can first submit a job to Slurm so we can verify that we have a working set of Slurm parameters. As mentioned, these will vary depending on your circumstances, such as the projects to which you have access to on DiRAC, and the DiRAC site to which you are submitting jobs.

Once you’ve determined these, edit basic-script.sh (as shown above), substitute the correct --account and --partition values and add any additional parameters needed for your site (e.g. --qos), and then save the file. Next, submit that script using sbatch basic-script.sh. It may take some trial and error to find the correct parameters! Once you’ve successfully submitted the job, you should have a job identifier returned in the terminal (something like 309001).

Lastly, you can use that job identifier to query the status of our job until it’s completed using squeue <jobid> (e.g. squeue 309001). Once the job is complete, we are able to read the job’s log file, typically held in a file named slurm-<jobid>.out, which show us any printed output from a job, and depending on the HPC system, perhaps other information regarding how and where the job ran.

DiRAC Project Allocations

Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC), which provides access through allocations to researchers who request time on the facility. There are a number of mechanisms through which facility time can be requested:

Call for full Proposals: the RAC puts out an annual invitation to the UK theory and modelling communities to apply for computational resources on the DiRAC HPC Facility, with applications taking the form of scientific, technical, or Research Software Engineer support (RSE) time.
Director’s Discretionary Award: from time to time the DiRAC Director invites the UK theory and modelling communities to apply for discretionary allocations of computational resources. Discretionary time can also be applied for if you find you will be using your code at a larger scale than was previously requested in a full call for proposals. Applications can be made at any time.
Seedcorn Time: for researchers who would like to get a feel for HPC, test and benchmark codes, or see what DiRAC resources can do for you before making a full application, an application can be made for seedcorn time. Existing users may also apply for seedcorn allocations to enable code development/testing on a service that is not currently part of their project allocation. You can apply for Seedcorn Time at any time.

For more information regarding these options, and for online application forms and contact details for enquiries, see the DiRAC website.

Once the submission and its technical case has been approved, allocations are managed within in 3-month chunks over the duration of the project, which may be over a period of years. Allocation usage is based primarily on core CPU hours or GPU hours. Following a 3-month allocation project usage is reviewed and extended based on that review (for instance, any non-use of the allocation during that 3-month window is queried, with support provided to overcome any barriers to use.) In addition, Research Software Engineering (RSE) support time may also be requested, who can provide help with code optimisation, porting, re-factoring and performance analysis.

Key Points

A job scheduler ensures jobs are given the resources they need, and manages when and where jobs will run on an HPC resource.

Obtain your DiRAC account details to use for job submission from DiRAC’s SAFE website.

Use sinfo -s to see information on all queues on a Slurm system.

Use scontrol show partition to see more detail on particular queues.

Use sbatch and squeue to submit jobs and query their status.

Access to DiRAC’s resources is managed through the STFC’s independent Resource Allocation Committee (RAC).

Facility time may be requested through a number of mechanisms, namely in response to a Call for Proposals, the Director’s Discretionary Award, and Seedcorn Time.

The DiRAC website has further information on the methods to request access, as well as application forms.

Job Submission and Management

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do I request specific resources to use for a job?

What is the life cycle of a job?

What can I do to specify how my job will run?

How can I find out the status of my running or completed jobs?

Objectives

Specify a job’s resource requirements in a Slurm batch script.

Describe the main states a job passes through from start to completion.

Cancel multiple jobs using a single command.

Describe how backfilling works and why it’s useful.

Obtain greater detail regarding the status of a completed job and its use of resources.

We’ve recapped the fundamentals of how we are able to use Slurm to submit a simple job and monitor it to completion, but in practice we’ll need more flexibility in specifying what resources we need for our jobs, how they should run, and how we manage them, so let’s look at that in more detail now.

Selecting Resources for Jobs

Selecting Available Resources

We saw earlier how to use sinfo -s as a means to obtain a list of queues we are able to access, but there’s also some additional useful information. Of particular interest is the NODES column, which gives us an overview of the state of these resources, and hence allows us to select a queue with those resources that are sufficiently available for jobs we’d like to submit. For example, on Tursa you may see something like:

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
slurm*       up 4-00:00:00      198/0/0/198 dnode[001-198]
devel        up    2:00:00      198/2/0/200 dnode[001-200]

Looking at the NODES column, it indicates how many nodes are:

Active - these are running jobs
Idle - no jobs are running
Other - these nodes are down, or otherwise unavailable
The Total number of nodes

The NODELIST is a summary of those nodes in a particular queue.

In this particular instance, we can see that 2 nodes are idle in the devel queue, so if that queue fits our needs (and we have access to it) we may decide to submit to that.

Specifying Job Resource Requirements

One thing that is absolutely critical when working on an HPC system is specifying the resources required to run a job, which allows the scheduler to find the right time and place to schedule our job. If you do not specify requirements (such as the amount of time you need), you will likely be stuck with your site’s default resources, which is probably not what you want.

When launching a job, we can specify the resources our job needs in our job script, using #SBATCH <parameter>. We’ve used this format before when indicating the account and queue to use. You may have seen some of these parameters before, but let’s take a look at the most important ones and how they relate to each other.

For these parameters, by default a task refers to a single CPU unless otherwise indicated.

--nodes - the total number of machines or nodes to request
--ntasks - the total number of CPU cores (across requested nodes) your job needs. Generally, this will be 1 unless you’re running MPI jobs which are able to use multiple CPU cores. In which case, it essentially specifies the number of MPI ranks to start across the nodes. For example, if --nodes=4 and --ntasks=8, we’re requesting 4 nodes, with each node having 2 CPU cores, since to satisfy the requirement of a total of 8 CPU cores over 4 nodes, each node would need 2 cores. So here, Slurm does this calculation for us. We’ll see an example of using --ntasks with MPI in a later episode.

Being Specific

Write and submit a job script - perhaps adapted from our previous one - called multi-node-job.sh that requests a total of 2 nodes with 2 CPUs per node. Remember to include any site-specific #SBATCH parameters.
Solution
#!/bin/bash -l
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --time=00:00:30

echo "This script is running on ... "
hostname
sbatch multi-node-job.sh
...
squeue -u yourUsername
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
6485195 cosma7-pa mul-job. yourUse+  R       0:05      2 m[7400-7412]
Here we can see that the job is using a total of two nodes as we’d like - m7400 and m7401.

Our example earlier with --nodes=4 and --ntasks=8 meant Slurm calculated that 2 cores would be needed on each of the four nodes (i.e. ntasks/nodes = 2 cores per node). So the number of cores needed was implicit. However, we can also specify the number of CPU cores per node explicitly using --ntasks-per-node. In this case, we use this with --nodes and we don’t need to specify the total number of tasks with --ntasks at all. So using our above example with --nodes=4, to get our desired total 8 CPU cores we’d specify --ntasks-per-node=2.

Being Even More Specific

Write and submit a job script that uses --nodes and --ntasks-per-node to request a total of 2 nodes with 2 CPUs per node.
Solution
#!/bin/bash -l
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=00:00:30

echo "This script is running on ... "
hostname
Once submitted, using squeue we should see the same results as before, again using a total of two nodes.

We’ll be looking at how we can make full use of these parameters with MPI jobs later in this lesson.

Up until now, the number of tasks have been synonymous with the number of CPUs. However, we can also specify a multiple number of CPUs per task too, by using --cpus-per-task - so if your job is multithreaded, for example it makes use of OpenMP, you can specify how many threads it needs using this parameter.

What about Parameters Named -A, -N, -J and others?

You may also see some other shorter parameter names. In Slurm, some SBATCH options also have shortened forms, but mean the same thing. For example:

Long form Short form

--job-name -J

--nodes -N

--ntasks -n

--cpus-per-task -c

Not all parameters have a short form, for example --nodes-per-task and --exclusive.

In practice you can use either, although using the longer form is more verbose and helps remove ambiguity, particularly for newer users.

Long form	Short form
`--job-name`	`-J`
`--nodes`	`-N`
`--ntasks`	`-n`
`--cpus-per-task`	`-c`

Only Request What you Require

When specifying resources your job will need it’s important not to ask for too much. Firstly, because any resources you request but don’t use (e.g. CPUs, memory, GPUs) will be wasted and potentially cost more in terms of your account usage, but also because requesting larger resources will take longer to queue. It also means that other user’s jobs that would have been a better fit for these resources may also take longer to run. It’s good to remember that we are part of a wider community of HPC users, and as with any shared resource, we should act responsibly when using it.

As we’ll see later, using the sacct command we’re able to find out what resources previous jobs actually used, and then optimise the resources we request in future job runs.

Managing Job Submissions

Monitoring a Job

As we’ve seen, we can check on our job’s status by using the command squeue. Let’s take a look in more detail.

squeue -u yourUsername

You may find it looks like this:

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5791510 cosma7-pa example- yourUser PD       0:00      1 (Priority)

So -u yourUsername shows us all jobs associated with our machine account. We can also use -j to query specific job IDs, e.g.: squeue -j 5791510 which will, in this case, yield the same information since we only have that job in the queue (if it hasn’t already completed!).

In either case, we can see all the details of our job, including the partition, user, and also the state of the job (in the ST column). In this case, we can see it is in the PD or PENDING state. Typically, a successful job run will go through the following states:

PD - pending: sometimes our jobs might need to wait in a queue first before they can be allocated to a node to run.
R - running: job has an allocation and is currently running
CG - completing: job is in the process of completing
CD - completed: the job is completed

For pending jobs, helpfully, you may see a reason for this in the NODELIST(REASON) column; for example, that the nodes required for the job are down or reserved for other jobs, or that the job is queued behind a higher priority job that is making use of the requested resources. Once it’s able to run, the nodes that have been allocated will be displayed in this column instead.

However, in terms of job states, there are a number of reasons why jobs may end due to a failure or other condition, including:

OOM - ouf of memory: the job attempted to use more memory during execution than what was available
S - suspended: job has an allocation, but it has been suspended to allow for other jobs to use the resources
CA - cancelled: the job was explicitly cancelled, either by the user or system administrator, and may or may not have been started
F - failed: the job has terminated with a non-zero exit code or has failed for another reason

You can get a full list of job status codes via the SLURM documentation.

Cancelling a Job

Sometimes we’ll make a mistake and need to cancel a job. This can be done with the scancel command. Let’s submit a job and then cancel it using its job number (remember to change the walltime so that it runs long enough for you to cancel it before it is killed!).

sbatch example-job.sh
squeue -u yourUsername

Submitted batch job 5791551

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
5791551 cosma7-pa hello-wo yourUser PD       0:00      1 (Priority)

Now cancel the job with its job number (printed in your terminal). A clean return of your command prompt indicates that the request to cancel the job was successful.

scancel 5791551
# It might take a minute for the job to disappear from the queue...
squeue -u yourUsername

...(no output when there are no jobs to display)...

Cancelling multiple jobs

We can also cancel all of our jobs at once using the -u yourUsername option. This will delete all jobs for a specific user (in this case, yourself). Note that you can only delete your own jobs.

Try submitting multiple jobs and then cancelling them all.
Solution

First, submit a trio of jobs:
sbatch example-job.sh
sbatch example-job.sh
sbatch example-job.sh
Then, cancel them all:
scancel -u yourUsername

Backfilling

When developing code or testing configurations it is usually the case that you don’t need a lot of time. When that is true and the queues are busy, backfilling is a scheduling feature that proves really useful.

If there are idle nodes then that means they are available to run jobs, or that they are being kept so that a job can run in the future. The time between when a job needs those nodes and now is the backfill window and jobs that need less than that time may be scheduled to run on those resources. To take best advantage of backfilling, specify a shorter time period for jobs so they can fill these windows.

You can check if a particular Slurm scheduler is configured to use the backfill technique (which is the default) by doing the following:

scontrol show config | grep SchedulerType

This use of the scontrol command allows us to see the configuration of Slurm, and by piping its output (using |) to the grep command, we can search for a particular configuration field called SchedulerType which holds the method used for scheduling. If a backfill scheduler is being used, you should see the following:

SchedulerType           = sched/backfill

If the builtin scheduler is used instead, then jobs will be executed in order of submission, regardless if there are any free resources to run your job immediately.

DiRAC COSMA: Handy Backfill Tools

On DiRAC’s COSMA you are able to see backfill availability windows for various queues (e.g. for COSMA5, COSMA7,and COSMA8) by using a particular backfill command. For example, for COSMA5:
c5backfill
# Backfill windows available for cosma queues:
#
No. hosts/slots:  1 / 16
Runtime:          31 hours 49 minutes 26 seconds
Hostnames:        m5336

No. hosts/slots:  5 / 80
Runtime:          72 hours 00 minutes 00 seconds
Hostnames:        m[5185,5315,5337,5366,5377]
So here, we can see that on COSMA5 we have only one node available for 31 hours and 49 minutes, and 5 nodes available for 72 hours.

Inspecting and Controlling the State of Jobs

So far we’ve used the squeue command to check the general status of pending and running jobs. But we can also obtain even more information using the scontrol command regarding the job’s status, configuration and use of resources.

Let’s create a new job to test this, that uses a single node and a single CPU on that node to run a single basic task:

#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --time=00:00:30
#SBATCH --job-name=specific-job

sleep 25
echo "This script is running on ... "
hostname

Next, launch our new job:

sbatch specific-job.sh

…then make a note of the job ID returned, e.g.:

...
Submitted batch job 309281

We can then use this job ID to ask Slurm for more information about it:

scontrol show jobid=309281

On DiAL3, it would look something like this (although of course, on other sites it will differ in parts):

JobId=309281 JobName=specific-job
   UserId=dc-crou1(210462) GroupId=dirac(59019) MCS_label=N/A
   Priority=7603 Nice=0 Account=ds007 QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:25 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2024-08-01T11:57:18 EligibleTime=2024-08-01T11:57:18
   AccrueTime=2024-08-01T11:57:18
   StartTime=2024-08-01T11:57:19 EndTime=2024-08-01T11:57:44 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-01T11:57:19 Scheduler=Main
   Partition=devel AllocNode:Sid=d3-login01:2159102
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=dnode036
   BatchHost=dnode036
   NumNodes=1 NumCPUs=128 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=128,node=1,billing=128
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/dirac3/home/dc-crou1/job-sched-testing/specific-job.sh
   WorkDir=/lustre/dirac3/home/dc-crou1/job-sched-testing
   StdErr=/lustre/dirac3/home/dc-crou1/job-sched-testing/slurm-309281.out
   StdIn=/dev/null
   StdOut=/lustre/dirac3/home/dc-crou1/job-sched-testing/slurm-309281.out
   Power=

In particular, we can see:

As we might expect the JobState is RUNNING, although it may be PENDING if waiting to be assigned to a node, or if we weren’t fast enough running the scontrol command it might be COMPLETED
How long the job has run for (RunTime), and the job’s maximum specified duration (TimeLimit)
The job’s SubmitTime, as well as the job’s StartTime for execution: this may be the actual start time, or the expected start time if set in the future. The expected EndTime is also specified, although if it isn’t specified directly in the job script this isn’t always exactly StartTime + specified duration; it’s often rounded up, perhaps to the nearest minute.
The queue assigned for the job is the devel queue, and that the job is running on the dnode036 node
The resources assigned to the job are a single node (NumNodes=1) with 128 CPU cores, for a single task with 1 CPU core per task. Note that in this case we got more resources in terms of CPUs than what we asked for. For example in this instance, we can see that we actually obtained a node with 128 CPUs (although we won’t use them)
We didn’t specify a working directory within which to execute the job, so the default WorkDir is our home directory
The error and output file locations, as specified by StdErr and StdOut

You can find more information on scontrol and the various fields in the Slurm documentation.

Note that typing scontrol on its own will enter an interactive state with an scontrol prompt which allows you to enter subcommands (like show or update) directly to the scontrol command. You can exit by typing quit.

Let’s Wait a Bit…

It may be useful to defer the scheduling of a job until some point in the future. We can specify a delayed start time (or even delayed start date) by using the --begin parameter in our job script. For example, adding #SBATCH --begin=16:00 will delay starting the job until 16:00 (assuming there is an appropriate node available at that time). We can also specify a relative delay time, for example passing now+1hour instead.

Launch the job again, but this time specify a start time of 1 minute in the future in the job script. How does the scontrol output change for a delayed job?

Solution

First add #SBATCH --begin=now+1minute to the parameters in the job script, and relaunch using sbatch.

You should see in the job’s scontrol output:

JobState=PENDING, RunTime=00:00:00, and AccrueTime is Unknown, since it’s waiting to run and no runtime has yet elapsed

EligibleTime will specify the calculated start time one minute into the future

Querying Job Resources

For more detail about jobs and the resources, there are two key Slurm commands we can use to give us more information about jobs that are either actively running (using sstat) or completed (using sacct).

Querying Running Jobs

First, submit our previous job script specific-job.sh using sbatch, then check when it’s actually running using squeue (since sstat will only work on actively running jobs), and then when we know it’s running we can use sstat to query which resources it’s using. Note that it’s probably a good idea to increase the 25 value for sleep in the job script (along with the #SBATCH --time parameter) so you have more time to query the job before it completes!

For example, following submission via sbatch:

squeue -j 6803898

Wait until it reaches the running state designated by R:

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
6803898 cosma7-pa hello-wo yourUser  R       0:00      1 (Priority)

Since we now know it’s running, we may query its active resource usage to date.

sstat -j 6803898

The first line of output is presented as a series of columns representing the various properties - or fields - associated with the job, but unfortunately you’ll notice that with so many columns it renders the output quite unreadable! Fortunately we can select which columns we wish to see, e.g.:

sstat -j 6803898 --format=JobID,AveCPU,NTasks,MinCPU,MaxDiskWrite

So here, we’ve selected a number of field columns we’re interested in:

JobID - the ID of the job, useful if we happen to specify multiple jobs to query, e.g. -j 6803898,6803899,6803900.
AveCPU - the average CPU time of the job
NTasks - total number of tasks in a job
MinCPU - the minimum CPU time used in a job
MaxDiskWrite - the maximum number of bytes written by the job

But looking at the output, you’ll notice there is no information is returned other than the field headers:

JobID            AveCPU   NTasks     MinCPU MaxDiskWrite
------------ ---------- -------- ---------- ------------ 

This is because, by default, all jobs are actually made up of a number of job steps, and batch scripts running by themselves aren’t accounted for.

A job that consists of only a batch script, like this one, typically only has two steps:

batch step, which is created for all jobs submitted with sbatch
extern step, which accounts for resources used by a job outside of aspects managed by Slurm (which includes special cases such as SSH-ing into the compute node, which may be allowed on some systems and would be accounted here)

A job may also have additional steps, for example with MPI jobs there will be extra processes launched which are accounted for as separate steps, and will be listed separately. In our case, all processing is accounted for in the batch step, so we need to query that directly, since by default, sstat doesn’t include this. We can reference the batch step directly using:

sstat -j 6803898.batch --format=JobID,AveCPU,NTasks,MinCPU,MaxDiskWrite

And now we should see something like the following (you may need to submit it again):

JobID            AveCPU   NTasks     MinCPU MaxDiskWrite
------------ ---------- -------- ---------- ------------
6803898.batch  00:00:00        1   00:00:00           66

In this trivial case, so far we can see that we’ve used virtually no actual CPU, and the batch submission script has written a total of 66 bytes.

There’s also the --allsteps flag to see a summary of all steps, although note that this isn’t always supported on all Slurm systems:

sstat --allsteps -j 6803898 --format=JobID,AveCPU,NTasks,MinCPU,MaxDiskWrite

So in our batch case, we’d see two steps:

JobID             AveCPU   NTasks     MinCPU MaxDiskWrite 
------------- ---------- -------- ---------- ------------ 
6803898.exte+   00:00:00        1   00:00:00        1.00M 
6803898.batch   00:00:00        1   00:00:00           66

What About other Format Fields?

We’ve looked at some example fields to include in the output format, but there are many others that you may wish to consider, which you can find in the Slurm documentation.

Another way to get help on the field formats is to use sstat --helpformat, which handily gives you a list of all fields you can use.

Querying Completed Jobs

Once a job is complete, since it only operates on active jobs you’ll notice that sstat no longer presents resource usage. For jobs that have completed, we use sacct, which displays accounting data for Slurm jobs.

We can first use sacct to list the completed jobs for today:

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6803854            bash cosma7-pa+ yourAccou+         28    TIMEOUT      0:0 
6803854.ext+     extern            yourAccou+         28  COMPLETED      0:0 
6803854.0          bash            yourAccou+         28  COMPLETED      0:0 
6803898      specific-+ cosma7-pa+ yourAccou+         28  COMPLETED      0:0 
6803898.bat+      batch            yourAccou+         28  COMPLETED      0:0 
6803898.ext+     extern            yourAccou+         28  COMPLETED      0:0 
6804261_1    hello_wor+ cosma7-pa+ yourAccou+         28  COMPLETED      0:0 
6804261_1.b+      batch            yourAccou+         28  COMPLETED      0:0 
6804261_1.e+     extern            yourAccou+         28  COMPLETED      0:0 
6804261_2    hello_wor+ cosma7-pa+ yourAccou+         28  COMPLETED      0:0 
6804261_2.b+      batch            yourAccou+         28  COMPLETED      0:0 
6804261_2.e+     extern            yourAccou+         28  COMPLETED      0:0 
6804261_3    hello_wor+ cosma7-pa+ yourAccou+         28  COMPLETED      0:0 
6804261_3.b+      batch            yourAccou+         28  COMPLETED      0:0 
6804261_3.e+     extern            yourAccou+         28  COMPLETED      0:0 

Or, for a specific job:

sacct -j 6803898

Which will display accounting information for a given job, including its constituent steps (and we don’t need to use --allsteps to collate this information).

By default, we’ll see something like the following:

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6803898      specific-+ cosma7-pa+ yourAccou+         28  COMPLETED      0:0 
6803898.bat+      batch            yourAccou+         28  COMPLETED      0:0 
6803898.ext+     extern            yourAccou+         28  COMPLETED      0:0 

As with sstat, we are also able to customise the fields we wish to see, e.g.:

sacct -j 6803898 --format=JobID,JobName,Partition,Account,AllocCPUS,State,Elapsed,CPUTime

JobID           JobName  Partition    Account  AllocCPUS      State    Elapsed    CPUTime 
------------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- 
6803898      specific-+ cosma7-pa+ yourAccou+         28  COMPLETED   00:00:25   00:11:40 
6803898.bat+      batch            yourAccou+         28  COMPLETED   00:00:25   00:11:40 
6803898.ext+     extern            yourAccou+         28  COMPLETED   00:00:25   00:11:40

As with sstat, you can add many other fields too, although note the accounting data presented for these will be different depending on the HPC system’s configuration and the job’s type. CPUTime here is equal to AllocCPUs * Elapsed, i.e. the number of CPUs allocated to the job multiplied by the total elapsed time.

So Much to Remember!

Slurm has a great many commands - as well as command arguments - and it can prove difficult to remember them all. It’s often helpful to make a note of those you commonly use, or make use of a pre-made reference/cheat sheet such as this one which has a comprehensive list of commands and and their arguments with helpful short descriptions.

Key Points

Use --nodes and --ntasks in a Slurm batch script to request the total number of machines and CPU cores you’d like for your job.

A typical job passes through pending, running, completing, and completed job states.

Reasons for a job’s failure include being out of memory, being suspended, cancelled, or exiting with a non-zero exit code.

Determinig backfill windows allows us to determine when we may make use of idle resources.

Use scontrol to present detailed information concerning a submitted job.

Use sstat to query the used resources for an actively running job.

Use sacct to query the resource accounting information for a completed job.

Accessing System Resources using Modules

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I make use of HPC system resources such as compilers, libraries, and other tools?

What are HPC modules, and how do I use them?

Objectives

Load and use a software package.

Explain how the shell environment changes when the module mechanism loads or unloads packages.

Use modules in a job script.

On a high-performance computing system, it is seldom the case that the software we want to use - things like compilers and libraries - is available when we log in. It is installed, but we will need to “load” it before it can run.

Before we start using individual software packages, however, we should understand the reasoning behind this approach. The three biggest factors are software incompatibilities, versioning, and dependencies.

Software incompatibility is a major headache for programmers. Sometimes the presence (or absence) of a software package will break others that depend on it. Two of the most famous examples are Python 2 and 3 and C compiler versions. Python 3 famously provides a python command that conflicts with that provided by Python 2. Software compiled against a newer version of the C libraries and then used when they are not present will result in a nasty 'GLIBCXX_3.4.20' not found error, for instance.

Software versioning is another common issue. A team might depend on a certain package version for their research project - if the software version was to change (for instance, if a package was updated), it might affect their results. Having access to multiple software versions allow a set of researchers to prevent software versioning issues from affecting their results.

Dependencies are where a particular software package (or even a particular version) depends on having access to another software package (or even a particular version of another software package). For example, the VASP materials science software may depend on having a particular version of the FFTW (Fastest Fourier Transform in the West) software library available for it to work.

Environment Modules

Environment modules are the solution to these problems. A module is a self-contained description of a software package – it contains the settings required to run a software package and, usually, encodes required dependencies on other software packages.

There are a number of different environment module implementations commonly used on HPC systems: the two most common are TCL modules and Lmod. Both of these use similar syntax and the concepts are the same so learning to use one will allow you to use whichever is installed on the system you are using. In both implementations the module command is used to interact with environment modules. An additional subcommand is usually added to the command to specify what you want to do. For a list of subcommands you can use module -h or module help. As for all commands, you can access the full help on the man pages with man module.

On login you may start out with a default set of modules loaded or you may start out with an empty environment; this depends on the setup of the system you are using.

Listing Available Modules

To see available software modules, use module avail.

module avail

On COSMA, it looks something like the following, although your site output will differ:

--------------------- /cosma/local/Modules/modulefiles/mpi ---------------------
hpcx-mt/2.2             intel_mpi/2020            openmpi/4.0.3               
intel_mpi/2017          intel_mpi/2020-update1    openmpi/4.0.5               
intel_mpi/2018          intel_mpi/2020-update2    openmpi/4.1.1               
intel_mpi/2019          mvapich2_mpi/2.3.6        openmpi/4.1.1.no-ucx        
intel_mpi/2019-update1  mvapich2_mpi/2.3.6-debug  openmpi/4.1.4               
intel_mpi/2019-update2  mvapich2_mpi/2.3.7-1      openmpi/4.1.4-romio-lustre  
intel_mpi/2019-update3  openmpi/3.0.1(default)    openmpi/20190429            
intel_mpi/2019-update4  openmpi/4.0.1             rockport-settings           

------------------ /cosma/local/Modules/modulefiles/compilers ------------------
aocc/1.3.0                intel_comp/2019-update2            
aocc/2.0.0                intel_comp/2019-update3            
aocc/2.2.0                intel_comp/2019-update4            
...

What About Partial Matches?

A useful feature of module avail is that it also works on partial matches that begin with a given argument. For example, module avail x would display a shortened list of any modules beginning with x. This is handy if you need to search for a particular module but can’t remember the full name, or would like a succinct list of all versions of a particular module.

Using module avail, how many versions of openmpi are on your HPC system?
Solution

Typing module avail openmpi on DiRAC’s COSMA HPC resource, at the time of writing we get:
--------------------- /cosma/local/Modules/modulefiles/mpi ---------------------
openmpi/3.0.1(default)  openmpi/4.0.5         openmpi/4.1.4               
openmpi/4.0.1           openmpi/4.1.1         openmpi/4.1.4-romio-lustre  
openmpi/4.0.3           openmpi/4.1.1.no-ucx  openmpi/20190429   
So, a total of 9 module versions of openmpi. On Tursa:
--------------------------------------- /home/y07/shared/tursa-modules ---------------------------------------
openmpi/4.1.5
------------------------------ /mnt/lustre/tursafs1/apps/cuda-12.3-modulefiles -------------------------------
openmpi/4.1.5-cuda12.3  

----------------------------- /mnt/lustre/tursafs1/apps/cuda-11.4.1-modulefiles ------------------------------
openmpi/4.1.1-cuda11.4.1  

----------------------------------- /mnt/lustre/tursafs1/apps/modulefiles ------------------------------------
openmpi/4.0.4  openmpi/4.1.1 
On DiAL3:
------------------------------------------- /cm/shared/modulefiles -------------------------------------------
   openmpi4/intel/4.0.5
And on CSD3, we have something like:
-------------------------------------- /usr/local/software/modulefiles ---------------------------------------
openmpi/3.1.4-gcc-7.2.0  

----------------------------------- /usr/local/Cluster-Config/modulefiles ------------------------------------
openmpi-GDR/gnu/1.10.7_cuda-8.0  openmpi/gcc/9.2/4.0.1  openmpi/gcc/9.3/4.0.4  
openmpi-GDR/gnu/2.1.1_cuda-8.0   openmpi/gcc/9.2/4.0.2  openmpi/pgi/3.0.0     

Listing Currently Loaded Modules

You can use the module list command to see which modules you currently have loaded in your environment. If you have no modules loaded, you will see a message telling you so

module list

Depending on your system, you may find something like the following:

Currently Loaded Modulefiles:
 1) cosma/2018               3) armforge/22.0.2   5) gadgetviewer/1.1.3  
 2) python/2.7.15(default)   4) hdfview/3.1.4     6) utils/201805      

Depending on your site, you may find it returns with much a shorter list, or perhaps No Modulefiles Currently Loaded.

More or Less Information?

Using the -l switch with module list will give you more information about those modules loaded; namely, any additional version information for each module loaded and the last date/time the module was modified on the system. Conversely, using the -t switch will give you the output in a terse format, as a simple list of modules one per line.

These switches also work with avail. Using the -l switch with this command, determine the date a particular version of a module (such as openmpi or Python) was modified.
Solution

For example, using module avail -l openmpi/4.1.4 on COSMA at time of writing, we get:
- Package/Alias -----------------------.- Versions --------.- Last mod. -------
/cosma/local/Modules/modulefiles/mpi:
openmpi/4.1.4                                               2022/11/28 11:11:31
openmpi/4.1.4-romio-lustre                                  2022/09/14 10:48:34
Using module avail -l openmpi/pgi/3.0.0 on CSD3, we get:
- Package/Alias -----------------------.- Versions --------.- Last mod. -------
/usr/local/Cluster-Config/modulefiles:
openmpi/pgi/3.0.0                                           2018/05/17 14:25:11

Loading and Unloading Software

To gain or remove access to the typically numerous software modules we have available to us on an HPC system, we load or unload them.

Loading Software

To load a software module, we use module load.

Whilst the DiRAC sites have some modules in common, there are many differences in what software modules are available and not all modules are available on all sites. So in this example, for simplicity whilst investigating module loading, we’ll load a different module depending on your site (so make a note of it!):

Durham COSMA: julia
Edinburgh Tursa: cmake
Leicester DiAL3: ffmpeg
Cambridge CSD3: bison

We won’t use or investigate any of the packages in any detail, but merely use them to demonstrate the use of modules. They’re handy for training purposes, since the module names equate to the commands used to run them. Note that some of these commands are actually available on multiple sites across DiRAC.

Initially, our module is not loaded. We can test this by using the which command. which looks for programs the same way that Bash does, so we can use it to tell us where a particular piece of software is stored. So on DiAL3, we could do the following:

which ffmpeg

On your own site, substitute ffmpeg with the module above for your site.

You’ll likely get something like the following, complaining that it can’t find the command within our environment:

/usr/bin/which: no ffmpeg in (/cm/local/apps/lua/5.4.0/bin:/home/dc-crou1/.local/bin:/home/dc-crou1/bin:/cm/shared/apps/hwloc/1.11.11/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)

So we can now try to load our module with module load, so for DiAL3, for example:

module load ffmpeg
which ffmpeg

Which now shows us, in the case of DiAL3 and ffmpeg:

/cm/shared/apps/ffmpeg/5.0.1/bin/ffmpeg

Why Not Specify the Version of the Module?

Note here we aren’t specifying the precise version of the module that we want for simplicity here. However, feel free to use module avail <module_name> to determine the versions available on your HPC system and then load a specific version if you wish, e.g. module load julia/1.9.1

At some point or other, you will run into issues where only one particular version of some software will be suitable. Perhaps a key bugfix only happened in a certain version, or version X broke compatibility with a file format you use. In either of these example cases, it helps to be very specific about what software is loaded.

So, what just happened?

To understand the output, first we need to understand the nature of the $PATH environment variable. $PATH is a special environment variable that controls where a UNIX system looks for software. Specifically $PATH is a list of directories (separated by :) that the OS searches through for a command before giving up and telling us it can’t find it. As with all environment variables we can print it out using echo.

echo $PATH

On COSMA (with the Julia module loaded) this looks like:

/cosma/local/julia/1.9.1:/cosma/local/matlab/R2020b/bin:/cosma/local/gadgetviewer/1.1.4/bin:/cosma/local/hdfview/HDFView/3.1.4/bin:/cosma/local/arm/forge/22.0.2/bin:/cosma/local/Python/2.7.15/bin:/cosma/local/bin:/usr/lib64/qt-3.3/bin:/cosma/local/Modules/default/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin

You’ll notice a similarity to the output of the which command. In this case, there’s only one difference: the different directory at the beginning. When we ran the module load command, it added a directory to the beginning of our $PATH. Let’s examine what’s there (your particular path will differ depending on your site and the command):

ls /cosma/local/julia/1.9.1

So for Julia’s directory location on COSMA, it looks like:

base          contrib          etc              LICENSE.md  README.md    test           VERSION
CITATION.bib  CONTRIBUTING.md  HISTORY.md       Makefile    src          THIRDPARTY.md
CITATION.cff  deps             julia            Make.inc    stdlib       usr
cli           doc              julia.spdx.json  NEWS.md     sysimage.mk  usr-staging

Taking this to its conclusion, module load will therefore add software to your $PATH, which is what is meant by loading software: we are essentially changing our command line environment so we are able to make use of the software.

What About Loading Dependencies?

A special note on this - depending on which version of the module program that is installed at your site, module load may also load required software dependencies as well, or make specific mention that other modules need to be loaded beforehand.

To demonstrate, on DiRAC’s COSMA resource, let’s assume we want to load a particular version of OpenMPI:
module load openmpi/4.1.4
In this case, at the time of writing we get the following:
A compiler must be chosen before loading the openmpi module.
 Please load one of the following compiler modules:
 
         aocc_comp/4.0.0
         gnu_comp/11.1.0
         gnu_comp/13.1.0
         gnu_comp/9.3.0
         intel_comp/2022.1.2
         intel_comp/2022.3.0
So here, we need to explicitly load one of these compiler options before we are able to load OpenMPI. e.g. module load gnu_comp/13.1.0. Depending on your system and how it’s configured, your mileage will differ!

How Loading Affects the Environment

Note that this module loading process happens principally through the manipulation of environment variables like $PATH. There is usually little or no data transfer involved.

The module loading process manipulates other special environment variables as well, including variables that influence where the system looks for software libraries, and sometimes variables which tell commercial software packages where to find license servers.

The module command also restores these shell environment variables to their previous state when a module is unloaded.

If we need such detail, we are able to see the changes that would be made to our environment using module display. For example, on Tursa with cmake:

module display cmake

-------------------------------------------------------------------
/home/y07/shared/tursa-modules/cmake/3.27.4:

conflict        cmake
prepend-path    PATH /home/y07/shared/utils/core/cmake/3.27.4/bin
prepend-path    CPATH /home/y07/shared/utils/core/cmake/3.27.4/include
prepend-path    LD_LIBRARY_PATH /home/y07/shared/utils/core/cmake/3.27.4/lib
prepend-path    MANPATH /home/y07/shared/utils/core/cmake/3.27.4/man
-------------------------------------------------------------------

So here, we can see that loading version 3.27.4 of cmake will add /home/y07/shared/utils/core/cmake/3.27.4/bin to the start of our path. We can also see that it adds /home/y07/shared/utils/core/cmake/3.27.4/man to a variable called $MANPATH, which is a specific path that contains locations of additional software manual pages we can access. Once cmake is loaded, we are thus able to then use man cmake to access its manual page, which is really useful in general for seeing information about commands, their parameters, and how to use them.

Loading Multiple Versions of the Same Module?

You may ask so what if we load multiple versions of the same module? Depending on how your system is configured, this may be possible, e.g. on COSMA:
module load julia/1.9.1
module load julia/1.5.3
In some cases, you may encounter incompatibility dependency conflicts, particularly with underlying libraries. However you may not see any error at all, which could give rise to confusion. One way around this would be to exit your current terminal session and reconnect to the HPC resource which will reset your environment. But what about within the same login session? To remedy this, see the next section for how to unload modules.

Unloading Software

Conversely, we may wish to unload modules we have previously loaded. This is useful if we no longer need to use a module, or require another version of the module. In general, it’s always good practice to unload modules you aren’t currently using.

For example, assuming we already have Julia loaded, we can unload it using, e.g. on COSMA, with julia:

module unload julia

Depending on your site, use module unload with the module you loaded earlier.

Note we don’t have to specify the version number. Once unloaded, our environment no longer allows us to make use of the software until we load it again.

If we want to unload all modules in our environment, we can use the module purge command. But we aware that this will also remove any modules that are loaded automatically by default upon login.

Using Software Modules in Scripts

We’ve so far explored how to load modules within an interactive command line session, but if we want to make use of modules in our jobs we also need to load them in our job scripts so they are loaded on compute nodes when the job runs.

Create a job that is able to show the version of the module command you loaded earlier, e.g.:

CSD3: bison --version

DiAL3: ffmpeg -version (note it’s only using one hyphen!)

Tursa: cmake --version

COSMA: julia --version

Remember, no software is loaded by default! Running a job is very similar to logging on to the system, therefore you should not assume a module loaded on the login node is loaded on a compute node.
Solution

In version-module.sh (again, replacing yourAccount and aPartition, but also replacing cmake with the command for your site if it isn’t cmake):
#!/bin/bash -l
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:00:30

module load cmake

cmake --version
sbatch version-module.sh

Key Points

HPC systems use a module loading/unloading system to provide access to software.

To see the available modules on a system, we use module avail.

The software installed across the DiRAC sites can be different in terms of what’s installed and the versions that are available.

module list will show us which modules we currently have loaded.

We use module load and module unload to grant and remove access to modules on the system.

We should only keep loaded those modules we actively wish to use, and try to avoid loading multiple versions of the same software.

Using Different Job Types

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What types of job can I run on HPC systems?

How do I run a job that uses a node exclusively?

How can I submit an OpenMP job that makes use of multiple threads within a single CPU?

How do I submit a job that uses Message Passing Interface (MPI) parallelisation?

How can I submit the same job many times with different inputs?

How can I interactively debug a running job?

Objectives

Use a popular compiler to compile a C program before testing and submitting it.

Use a compute node for a job exclusively.

Describe how an OpenMP job parallelises its computation.

Compile and submit an OpenMP job.

Describe how an MPI job parallelises its computation.

Compile and submit an MPI job.

Highlight the key differences between OpenMP and MPI jobs.

Define and submit an array job to execute multiple tasks within a single job.

Use an interactive job to run commands remotely on a compute node.

So far we’ve learned about the overall process and the necessary “scaffolding” around job submission; using various parameters to configure a job to use resources, making use of software installed on compute nodes by loading and unloading modules, submitting a job, and monitoring it until (hopefully successful) completion. Using what we’ve seen so far, let’s take this further and look at some key types of job that we can run on HPC systems to take advantage of various types of parallelisation, using examples written in the C programming language. We’ll begin with a simple serial hello world example, and briefly explore various ways that code is parallelised and run to make best use of such systems.

Serial

With a serial job we run a single job on one node within a single process. Essentially, this is very similar to running some code via the command line on your local machine. Let’s take a look at a simple example written in C (the full code can also be found in hello_world_serial.c.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    printf("Hello world!\n");
}

After copying this into a file called hello_world_serial.c, we can then compile and run it, e.g.:

gcc hello_world_serial.c -o hello_world_serial
./hello_world_serial

Depending on your system, you may need to preload a module to compile C (perhaps either using cc or gcc). You should then see Hello world! printed to the terminal.

Be Kind to the Login Nodes

It’s worth remembering that the login node is often very busy managing lots of users logged in, creating and editing files and compiling software, and submitting jobs. As such, although running quick jobs directly on a login node is ok, for example to compile and quickly test some code, it’s not intended for running computationally intensive jobs and these should always be submitted for execution on a compute node.

The login node is shared with all other users and your actions could cause issues for other people, so think carefully about the potential implications of issuing commands that may use large amounts of resource.

Now, given this is a very simple serial job, we might write the following hw_serial.sh Slurm script to execute it:

#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_world_serial
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1M

./hello_world_serial

Note here we are careful to specify only what resources we think we need. In this case, a single node running a single process and very little memory (in fact, very likely a great deal less memory than that).

As before, as can submit this using sbatch hw_serial.sh to submit it, squeue -u yourUsername to monitor it until completion, and then look at the slurm-<job_number>.out file to see the Hello world! output.

Making Exclusive use of a Node

We can use #SBATCH --exclusive to indicate we’d like exclusive access to the nodes we request, such that they are shared with no other jobs, regardless of how many CPUs we actually need. If we are running jobs that require particularly large amounts of memory, CPUs, or disk access, this may be important. However, as you might suspect, requesting exclusive use of a node may mean it takes some time to be allocated a whole node in which to run. Plus, as a responsible user, be careful to ensure you only request exclusive access to a node when your job needs it!

Multi-threaded via OpenMP

OpenMP allows programmers to identify and parallelize sections of code, enabling multiple threads to execute them concurrently. This concurrency is achieved using a shared memory model, where all threads can access a common memory space and communicate through shared variables.

So with OpenMP, think of your program as a team with a leader (the master thread) and workers (the slave threads). When your program starts, the leader thread takes the lead. It identifies parts of the code that can be done at the same time and marks them, and these marked parts are like tasks to be completed by the workers. The leader then gathers a group of helper threads, and each helper tackles one of these marked tasks. Each worker thread works independently, taking care of its task. Then, once all the workers are done, they come back to the leader, and the leader continues with the rest of the program.

Let’s look at a parallel version of hello world, which launches a number of threads. You can find the code below in hello_world_omp.c.

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
    int num_threads, t;
    int *results;

    // Use an OpenMP library function to get the number of threads
    num_threads = omp_get_max_threads();

    // Create a buffer to hold the integer results from each thread
    results = malloc(sizeof(*results) * num_threads);

    // In parallel, within each thread available, store the thread's
    // number in our shared results buffer
    #pragma omp parallel shared(results)
    {
        int my_thread = omp_get_thread_num();
        results[my_thread] = my_thread;
    }

    for (t = 0; t < num_threads; t++)
    {
        printf("Hello world thread number received from thread %d\n", t);
    }
}

OpenMP makes use of compiler directives to indicate which sections we wish to run in parallel worker threads on a single CPU. Compiler directives are special comments that are picked up by the C compiler and tell the compiler to behave a certain way with the code being compiled.

How Does it Work?

In this example we use the #pragma omp parallel OpenMP compiler directive around a portion of the code, so each worker thread will run this in parallel. The number of threads that will run is set by the system and obtained using omp_get_max_threads().

We also need to be clear how variables behave in parallel sections, in particular to what extent they are shared between threads or private to each thread. Here, we indicate that the results array is shared and accessible across all threads within this parallel code portion, since in this case we want each worker’s thread to add its thread number to our shared array.

Once this parallelised section’s worker threads are complete, the program resumes a serial, single-threaded behaviour within the master thread, and outputs the results array containing all the worker thread numbers.

Now before we compile and test it, we need to indicate how many threads we wish to run, which is specified in the environment in a special variable and picked up by the program, so we’ll do that first:

export OMP_NUM_THREADS=3
gcc hello_world_omp.c -o hello_world_omp -fopenmp
./hello_world_omp

And we should see the following:

Hello world thread number received from thread 0
Hello world thread number received from thread 1
Hello world thread number received from thread 2

If we wish to submit this as a job to Slurm, we also need to write a submission script to run it that reflects this is an OpenMP job, so let’s put the following in a file called hw_omp.sh:

#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_world_omp
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --mem=50K

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

./hello_world_omp

So here we’re requesting a single node (--nodes=1) running a single process (--ntasks=1), and that we’ll need three CPU cores (--cpus-per-task=3) - each of which will run a single thread.

Next, we need to set OMP_NUM_THREADS as before, but here we set it to a special Slurm environment variable (SLURM_CPUS_PER_TASK) that is set by Slurm to hold the --cpus-per-task we originally requested (3). We could have simply set this to three, but this method ensures that the number of threads that will run will match whatever we requested. So if we change this request value in the future, we only need to change it in one place.

If we submit this script using sbatch hw_omp.sh, you should see something like the following in the Slurm output file:

Hello world thread number received from thread 0
Hello world thread number received from thread 1
Hello world thread number received from thread 2

Multi-process via Message Passing Interface (MPI)

Our previous example used multiple threads (i.e. parallel execution within a single process). Let’s take this parallelisation one step further to the level of a process, where we run separate processes in parallel as opposed to threads.

At this level, things become more complicated! With OpenMP we had the option to maintain access to variables across our threads, but between processes, memory isn’t shared, so if we want to share information between these processes we need another way to do it. MPI uses a distributed memory model, so communication is done via sending and receiving messages between processes.

Now despite this inter-process communication being a greater overhead, in general our master/worker model still holds. In MPI, from the outset, when an MPI-enabled program is run, we have a number of processes executing in parallel. Each of these processes is referred to as a rank, and one of these ranks (typically rank zero) is a coordinating, or master, rank.

So how does this look in a program? You can find the code below in hello_world_mpi.c.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int my_rank, n_ranks;
    int *resultbuf;
    int r;

    MPI_Init(&argc, &argv);

    // Obtain the rank identifier for this process, and the total number of ranks
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &n_ranks);

    // Create buffer to hold rank numbers received from all ranks
    // This will include the coordinating rank (typically rank 0),
    // which also does the receiving
    resultbuf = malloc(sizeof(*resultbuf) * n_ranks);

    // All ranks send their rank identifier to rank 0
    MPI_Gather(&my_rank, 1, MPI_INT, resultbuf, 1, MPI_INT, 0, MPI_COMM_WORLD);

    // If we're the coordinating rank (typically designated as rank 0),
    // then print out the rank numbers received
    if (my_rank == 0) {
        for (r = 0; r < n_ranks; r++) {
            printf("Hello world rank number received from %d\n", resultbuf[r]);
        }
    }

    MPI_Finalize();
}

This program is a fair bit more complex than the OpenMP one, since here we need to explicitly coordinate the sending and receiving of messages and do some housekeeping for MPI itself, such as setting up MPI and shutting it down.

How Does it Work?

After initialising MPI, in a similar vein to how we got the number of threads and threads identity, we obtain the number of total ranks (processses) and our rank number. Now in this example, for simplicity, we use a single MPI function MPI_Gather to simultaneously send the rank numbers from each separate process to the coordinating rank. Essentially, are sending my_rank (as MPI_INT, basically an integer) to rank 0, which receives all responses, including its own, within resultbuf. Finally, if the rank is the coordinating rank, then the results are output. The if (my_rank == 0) condition is important, since without it, all ranks would attempt to print the results, since with MPI, typically all processes run the entire program.

Let’s compile this now. First, we may need to load some module to provide us with the correct compiler and an implementation of MPI, so we can compile our MPI code.

On DiRAC’s COSMA, this looks like:

module load gnu_comp
module load openmpi

We can also load specific versions if we wish:

module load gnu_comp/13.1.0
module load openmpi/4.1.4

Note that on many sites there are often a number of compiler and MPI implementation options, but for the purposes of this training we’ll use openmpi with a readily available C compiler (or what is provided by the system by default).

Other DiRAC Sites?

On Cambridge’s CSD3 and Edinburgh’s Tursa:
module load openmpi
On Leicester’s DiAL3 we need to do something more specific, such as:
module load gcc/10.3.0/picedk
module load openmpi/4.1.6/ol2kfe

Once we’ve loaded these modules we can compile this code:

mpicc hello_world_mpi.c -o hello_world_mpi

So note we need to use a specialised compiler, mpicc, to compile this MPI code.

Now we’re able to run it, and specify how many separate processes we wish to run in parallel. However, since this is a multi-processing job we should submit it via Slurm. Let’s create a new submission script named hw_mpi.sh that executes this MPI job:

#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_world_mpi
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=3
#SBATCH --mem=1M

module load openmpi

mpiexec -n ${SLURM_NTASKS} ./hello_world_mpi

On this particular HPC setup on DiRAC’s COSMA, we need to load the OpenMPI module so we can run MPI jobs. For your particular site, substitute the MPI module load command you used previously.

Note also that we specify 3 to --ntasks this time to reflect the number of processes we wish to run.

You’ll also see that with MPI programs we use mpiexec to run them, and specifically state the number of MPI processes we specified in --ntasks by using the Slurm environment variable SLURM_NTASKS, so mpiexec will use 3 processes in this case.

Efficient use of Resources

Beware using --ntasks correctly when submitting non-MPI jobs. Setting this to a number greater than 1 will have the effect of running the job that many times, regardless of whether it’s using MPI or not, so It also has the effect of multiple jobs overwriting, as opposed to appending to, the jobs’ output log file, so the fact this has happened may not be obvious.

In the Slurm output file You should see something like:

Hello world rank number received from rank 0
Hello world rank number received from rank 1
Hello world rank number received from rank 2

Array

So we’ve seen how parallelisation can be achieved using threads and processes, but using a sophisticated job scheduler like Slurm, we are able to go a level higher using job arrays, where we specify how many separate jobs (as tasks) we want running in parallel instead.

One way we might do this is using a simple for loop within a Bash script to submit multiple jobs. For example, to submit three jobs, we could do:

for JOB_NUMBER in {1..3}; do
    sbatch a-script.sh $JOB_NUMBER
done

In certain circumstances this may be a suitable approach, particularly if each task differs substantially, and you need to control over each subtask explicitly or change configuration resource request parameters for each job task.

However, job arrays have some unique advantages over this approach; namely that job arrays are self-contained, so each array task is linked to the same job submission ID, which means we can use sacct and squeue to query them as a whole submission. Plus, we need no additional code to make it work, so it’s generally a simpler way to do it.

To make use of a job array approach in a Slurm we add an additional --array parameter to our submission script. So let’s create a new hello_job_array.sh script that uses it:

#!/usr/bin/env bash
#SBATCH --account=yourAccount
#SBATCH --partition=aPartition
#SBATCH --job-name=hello_job_array
#SBATCH --array=1-3
#SBATCH --output=output/array_%A_%a.out
#SBATCH --error=err/array_%A_%a.err
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=1M

echo "Task $SLURM_ARRAY_TASK_ID"
grep hello input/input_file_$SLURM_ARRAY_TASK_ID.txt
sleep 30

We’ve introduced a few new things in this script:

#SBATCH --array=1-3 - this job will create three array tasks, with task identifiers 1, 2, and 3.
#SBATCH --output=output/array_%A_%a.out - this explicitly specifies what we want our output file to be called and where it should be located, and will collect any output printed to the console from the job. %A will be replaced with the overall job submission id, and %a will be replaced by the array task id. So, assuming a job id of 123456, the first array task will have an output file called array_123456_1.out which will be stored in the output directory. Naming the output files in this way separates them by job id as well as array task id.
#SBATCH --error=err/array_%A_%a.err - similarly to --output=, this will store any error output for this array task (i.e. messages output to the standard error), in the specified error file in the err directory.
$SLURM_ARRAY_TASK_ID is a shell environment variable that holds the number of the individual array task running.
grep hello input/input_file_$SLURM_ARRAY_TASK_ID.txt here we use the grep command to search for the word “hello” withan input file with the filename input_file_$SLURM_ARRAY_TASK_ID.txt in the input directory, where $SLURM_ARRAY_TASK_ID will be replaced with the array id. For example, for the first array task, the input file will be called input_file_1.txt. We’ve used grep as an example command, but this technique can be applied to any program that accepts inputs in this way.

Given the jobs are trivial and finish very quickly, we’ve added a sleep 30 command at the end so each task takes an additional 30 seconds to run, so that you should be able to see the array job in the queue before it disappears from the list.

Before we submit this job, we need to prepare some input and output directories and some input files for it to use.

mkdir input
mkdir output
mkdir err

In the input directory, make some text files with the filenames input_file_1.txt, input_file_2.txt, and input_file_3.txt with some text that somewhere includes hello in it, e.g.

echo "hello there my friend" > input/input_file_1.txt
echo "hello world" > input/input_file_2.txt
echo "well hello, can you hear me?" > input/input_file_3.txt

Separating Input and Output Using Directories

A common technique for structuring where input and output should be located is to have them in separate directories. This is an established practice that ensures that input and output are kept separate for a computational process, and therefore cannot easily be confused. It can be tempting to just have all the files in a single directory, but when running multiple jobs with potentially multiple inputs and outputs, things can quickly become unmanageable!

If we now submit this job with sbatch and then use squeue -j jobID we should see something like the following, a single entry but with [1-3] in the JOBID indicating the three subtasks as part of this job:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     6803105_[1-3] cosma7-pa hello_jo yourUser PD       0:00      1 (Priority)

Once complete, we’ll find three separate job output log files: 6803105_1, 6803105_2, and 6803105_3, each corresponding to a specific task in the output directory. For example for 6803105_1, depending on our HPC resource we will see something like:

Task 1
Hello world!

Canceling Array Jobs

We are able to completely cancel an array job and all its tasks using scancel jobID as with a normal job. With the job above, scancel 6803105 would do this, as would scancel 6803105_[1-3], where we indicate explicitly all the tasks in the range 1-3. But we’re able to cancel individual tasks as well using, for example, scancel 6803205_1 for the first task.

Submit the array job again and use scancel to cancel only the second and third tasks.
Solution
scancel 6803205_2
scancel 6803205_3
Or:
scancel 6803205_[2-3]
As with cancelling a normal job, the slurm- output log files for tasks will still be produced containing any output up until the point the tasks are cancelled.

Interactive

We’ve seen that we can use the login nodes on an HPC resource to test our code (in a small way) before we submit it. But sometimes when developing more complex code it would be useful to have access in some way to the compute nodes themselves, particularly to explore or debug an issue. For this we can use interactive jobs.

By reserving a compute note explicitly for this purpose, an interactive job will grant us an interactive session on a compute node that meets our job requirements, although of course, as with any job, this may not be granted immediately! Then, once the interactive session is running, we are able to enter commands and have their output visible on our terminal as if we had direct access to the compute node.

To submit a request for an interactive job where we wish to reserve a single node and two cores, we can use Slurm’s srun command:

srun --account=yourAccount --partition=aPartition --nodes=1 --ntasks-per-node=2 --time=00:10:00 --pty /bin/bash

So as well as the account/partition and the number of nodes and cores, we are requesting 10 minutes of interactive time (after which the interactive job will exit), and that the job will run a Bash shell on the node which we’ll use to interact with the job.

You should see the following, indicating the job is waiting, and then hopefully soon afterwards, that the job has been allocated a suitable node:

srun: job 5608962 queued and waiting for resources
srun: job 5608962 has been allocated resources
[yourUsername@m7443 ~]$ 

At this point our interactive job is running our Bash shell remotely on compute node m7443. We can also verify that we are on a compute node by entering hostname, which will return the host name of the compute node on which the job is running.

At this point, we are able to use the module, srun and other commands as they might appear within our job submission scripts:

[yourUsername@m7443 ~]$ srun --ntasks=2 ./hello_world_mpi

Hence, if this MPI code were faulty, as we encounter issues we have the opportunity to diagnose them in real time, fix them, and re-run our code to test it again.

When you wish to exit your session use exit, or Ctrl-D. You can check that your session is completed using squeue -j with the job ID as normal.

Interactive Sessions: Watch your Budget!

Importantly, note that whilst the terminal is active your allocation is consuming budget, just as with a normal job, so be very aware to not leave an interactive session idle!

Combined Power (with a Note of Caution)

For many job types it may also make sense to combine them together. As we’ve seen we’re able to run MPI jobs on a compute node over an interactive session, and one very powerful approach to parallelism involves using both multi-threaded (OpenMP) and multi-process (MPI) at the same time (known as hybrid OpenMP/MPI). and it’s very possible to also configure jobs to make use of job arrays with these approaches too.

As you may imagine, using multiple approaches offers tremendous flexibility and power to vastly scale what you are able to accomplish, although it’s worth remembering that these also have the tradeoff of consuming more resources, and thus more budget, at a commensurate rate. When running applications (or developing applications to run) on HPC resources it’s therefore strongly recommended to first start with small, simple jobs until you have high confidence their behaviour is correct.

Key Points

The login node is shared with all other users so be careful not to run complex programs on them.

OpenMP allows programmers to specify sections of code to execute in parallel threads on a single CPU, using compiler directives.

MPI programs make use of message-passing between multiple processes to coordinate communication.

OpenMP uses a shared memory model, whilst MPI uses a distributed memory model.

An array job is a self-contained set of multiple jobs - known as tasks - managed as a single job.

Interactive jobs allow us to interact directly with a compute node, so are very useful for debugging and exploring the running of code in real-time.

Interactive jobs consume resources while an interactive session is active, so we must be careful to use them efficiently.

Run and develop applications at a small scale first, before making use of powerful scaling techniques, to avoid potentially expensive consumption of resources.

Survey

Overview

Teaching: min
Exercises: min

Questions

Objectives

Post-Lesson Survey

Key Points