Introduction to Parallelism
|
Processes do not share memory and can reside on the same or different computers.
Threads share memory and reside in a process on the same computer.
MPI is an example of multiprocess programming whereas OpenMP is an example of multithreaded programming.
Algorithms can have both parallelisable and non-parallelisable sections.
There are two major parallelisation paradigms; data parallelism and message passing.
MPI implements the Message Passing paradigm, and OpenMP implements data parallelism.
|
Introduction to the Message Passing Interface
|
The MPI standards define the syntax and semantics of a library of routines used for message passing.
By default, the order in which operations are run between parallel MPI processes is arbitrary.
|
Communicating Data in MPI
|
Data is sent between ranks using “messages”
Messages can either block the program or be sent/received asynchronously
Knowing the exact amount of data you are sending is required
|
Point-to-Point Communication
|
Use MPI_Send() and MPI_Recv() to send and receive data between ranks
Using MPI_Ssend() will always block the sending rank until the message is received
Using MPI_Send() may block the sending rank until the message is received, depending on whether the message is buffered and the buffer is available for reuse
Using MPI_Recv() will always block the receiving rank until the message is received
|
Collective Communication
|
|
Non-blocking Communication
|
Non-blocking communication often leads to performance improvements compared to blocking communication
However, it is usually more difficult to use non-blocking communication
Most blocking communication operations have a non-blocking variant
We have to wait for a communication to finish using MPI_Wait() (or MPI_Test() ) otherwise we will encounter strange behaviour
|
Derived Data Types
|
Any data being transferred should be a single contiguous block of memory
By defining derived data types, we can more easily send data which is not contiguous
|
Porting Serial Code to MPI
|
Start from a working serial code
Write a parallel implementation for each function or parallel region
Connect the parallel regions with a minimal amount of communication
Continuously compare the developing parallel code with the working serial code
|
Optimising MPI Applications
|
We can use Amdahl’s Law to identify the theoretical limit in what parallelisation can achieve for performance
Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size
We can use Gustafson’s Law to calculate relative speedup which takes into account increasing problem sizes
Weak scaling is defined as how the solution time varies with the number of processors for a fixed problem size per processor
Use a profiler to profile code to understand its performance issues before optimising it
Ensure code is tested after optimisation to ensure its functional behaviour is still correct
|
Common Communication Patterns [Optional]
|
There are many ways to communicate data, which we need to think about carefully
It’s better to use collective operations, rather than implementing similar behaviour yourself
|
Advanced Data Communication [Optional]
|
Structures can be communicated easier by using MPI_Type_create_struct to create a derived type describing the structure
The functions MPI_Pack() and MPI_Unpack() can be used to manually create a contiguous memory block of data, to communicate complex and/or heterogeneous data structures
|
Survey
|
|
{:auto_ids}
key word 1
: explanation 1
key word 2
: explanation 2