Reference

Key Points

Lesson Schedule
Why Use a Cluster?
  • High Performance Computing (HPC) typically involves connecting to very large computing systems elsewhere in the world.

  • These HPC systems can be used to do work that would either be impossible or much slower or smaller systems.

  • The standard method of interacting with such systems is via a command line interface such as Bash.

Connecting to the remote HPC system
  • To connect to a remote HPC system using SSH and a password, run ssh yourUsername@remote.computer.address.

  • To connect to a remote HPC system using SSH and an SSH key, run ssh -i ~/.ssh/key_for_remote_computer yourUsername@remote.computer.address.

  • Protect your SSH keys by managing them carefully!

  • 2-factor authentication is a way to help ensure a user’s identity by requiring two forms of identity evidence.

Moving around and looking at things
  • Your current directory is referred to as the working directory.

  • To change directories, use cd.

  • To view files, use ls.

  • You can view help for a command with man command or command --help.

  • Hit tab to autocomplete whatever you’re currently typing.

Writing and reading files
  • There are many different text editors available on DiRAC.

  • Use nano to create or edit text files from a terminal.

  • Use cat file1 [file2 ...] to print the contents of one or more files to the terminal.

  • Use mv old dir to move a file or directory old to another directory dir.

  • Use mv old new to rename a file or directory old to a new name.

  • Use cp old new to copy a file under a new name or location.

  • Use cp old dir copies a file old into a directory dir.

  • Use rm old to delete (remove) a file.

  • File extensions are entirely arbitrary on UNIX systems.

  • Use scp to transfer files from and to a remote DiRAC resource.

  • Use tar to de-archive and archive sets of numerous and/or large files.

Wildcards and pipes
  • The * wildcard is used as a placeholder to match any text that follows a pattern.

  • Redirect a command’s output to a file with >.

  • Commands can be chained with |

Scripts, variables, and loops
  • A shell script is just a list of bash commands in a text file.

  • To make a shell script file executable, run chmod +x script.sh.

Using Bash Scripts in Pipes
  • You can include your own Bash scripts in pipes.

  • A common and useful pattern in Bash shell is to run a program or script that generates potentially a lot of output, then use pipes to filter out what you’re really after.

Reference
Lesson Schedule
Understanding Code Scalability
  • To make efficient use of parallel computing resources, code needs to be scalable.

  • Before using new code on DiRAC, it’s strong and weak scalability profiles has to be measured.

  • Strong scaling is how the solution time varies with the number of processors for a fixed problem size.

  • Weak scaling is how the solution time varies with the number of processors for a fixed problem size for each processor.

  • Strong and weak scaling measurements provide good indications for how jobs should be configured to use resources.

  • Always profile your code to determine bottlenecks before attempting any non-trivial optimisations.

Scalability Profiling
  • We can use Amdahl’s Law to understand the expected speedup of a parallelised program against multiple cores.

  • It’s often difficult to estimate the proportion of serial code in our programs, but a reformulation of Amdahl’s Law can give us this based on multiple runs against a different number of cores.

  • Run timings for serial code can vary due to a number of factors such as overall system load and accessing shared resources such as bulk storage.

  • The Message Passing Interface (MPI) standard is a common way to parallelise code and is available on many platforms and HPC systems, including DiRAC.

  • When calculating a strong scaling profile, the additional benefit of adding cores decreases as the number of cores increases.

  • The limitation of strong scaling is the fixed problem size, and we can increase the problem size with the core count to obtain a weak scaling profile.

Reference
Lesson Schedule
Software Development Lifecycle
  • Software engineering takes a wider view of software development beyond programming (or coding).

  • Software you produce has inherent value.

  • Always assume your code will be read and used by others (including a future version of yourself).

  • Additionally, aim to make your software reusable by others.

  • Reproducibility is a cornerstone of science, so ensure your software-generated results are reproducible.

  • Following a process makes development predictable, can save time, and helps ensure each stage of development is given sufficient consideration before proceeding to the next.

  • Ensuring requirements are sufficiently captured is critical to the success of any project.

An Introduction to Python
  • We’ll be using Python for the following parts of the material, here’s an introduction / refresher.

Functions and Classes
  • Functions allow us to decompose a problem down into smaller tasks.

  • Classes allow us to organise data which represents a distinct concept.

Programming Paradigms
  • A Paradigm describes a way of structuring reasoning about code.

  • Different programming languages are suited to different paradigms.

  • Different paradigms are suited to solving different classes of problems.

  • Pure functions are functions with deterministic behaviour and no side effects.

  • Classes allow us to organise data into distinct concepts.

Best Practices in Writing Code
  • Source code is designed for humans, not machines.

  • Source code is read much more often than it is written.

  • Always assume that someone else will read your code at a later date, including yourself.

  • Good indentation greatly enhances code readability.

  • Name things like variables, functions, and modules to indicate purpose.

  • Good comments describe the reasons behind coding approaches as well as complex behaviour.

  • Community coding conventions help you create more readable software projects that are easier to contribute to.

  • Maintainable code is easier to understand, modify, extend, and fix.

  • Assume any piece of code you write will be reused.

  • Technical debt is incurred when quick solutions are prioritised over good solutions, but is paid off in the cost of maintaining the code.

  • Change the way you write code to make maintainability a key goal.

Reference
Lesson Schedule
Test Strategy, Planning, and Running Tests
  • A test plan forms the foundation of any testing.

  • We should write tests to verify that functions generate expected output given a set of specific inputs.

  • The three main types of automated tests are unit tests, functional tests and regression tests.

  • We can use a unit testing framework like pytest to structure and simplify the writing of tests.

  • Testing program behaviour against both valid and invalid inputs is important and is known as data validation.

Development Tools
  • IDEs provide tools and features to help develop increasingly complex code.

  • Debuggers allow you to set breakpoints which pause running code so its state can be inspected.

  • A call stack is a chain of functions that have been executed prior to a certain point.

Reviewing Code
  • Code review is where at least one other person looks at parts of a codebase in order to improve its code readability, understandability, quality and maintainability.

  • The first hour of code review matters the most.

Documenting Code
  • A huge contributor to the ability to reuse any software is documentation.

  • Having only a short documentation document that covers the basics for getting the software up and running goes a long way, and can be amended and added to later.

  • Documentation helps make your code reproducible.

  • By default, software code released without a licence conveys no rights for reuse.

  • Open source licences fall into two key categories: copyleft and permissive.

Reference
Lesson Schedule
What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

  • GitHub needs an SSH key to allow access

Using a Repository
  • git clone creates a local copy of a repository from a URL.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log lists the commits made to the local repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Remote Repositories
  • Git can easily synchronise your local repository with a remote one

  • GitHub needs an SSH key to allow access

Reference

Glossary

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1
explanation 1
key word 2
explanation 2