Part 1: Sustainable Code Development

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What tools are needed to collaborate on code development effectively?

Objectives

Provide an overview of all the different tools that will be used in this course.

The first section of the course is dedicated to setting up your environment for collaborative software development. In order to build working (research) software efficiently and to do it in collaboration with others rather than in isolation, you will have to get comfortable with using a number of different tools interchangeably as they’ll make your life a lot easier. There are many options when it comes to deciding on which software development tools to use for your daily tasks - we will use a few of them in this course that we believe make a difference. There are sometimes multiple tools for the job - we select one to use but mention alternatives too. As you get more comfortable with different tools and their alternatives, you will select the one that is right for you based on your personal preferences or based on what your collaborators are using.

Tools needed to collaborate on code development effectively

Here is an overview of the tools we will be using.

Command Line & Virtual Development Environment

We will use the command line (also known as the command line shell/prompt/console) to run our code and interact with the version control tool Git and software sharing platform GitHub. We will also use command line tools venv and pip to set up a virtual development environment and isolate our software project from other projects we may work on.

Integrated Development Environment (IDE)

An IDE integrates a number of tools that we need to develop a software project that goes beyond a single script - including a smart code editor, a code compiler/interpreter, a debugger, etc. It will help you write well-formatted & readable code that conforms to code style guides (such as PEP8 for Python) more efficiently by giving relevant and intelligent suggestions for code completion and refactoring. IDEs often integrate command line console and version control tools - we teach them separately in this course as this knowledge can be ported to other programming languages and command line tools you may use in the future (but is applicable to the integrated versions too).

We will use PyCharm in this course - a free, open source IDE.

Git & GitHub

Git is a free and open source distributed version control system designed to save every change made to a (software) project, allowing others to collaborate and contribute. In this course, we use Git to version control our code in conjunction with GitHub for code backup and sharing. GitHub is one of the leading integrated products and social platforms for modern software development, monitoring and management - it will help us with version control, issue management, code review, code testing/Continuous Integration, and collaborative development.

Let’s get started with setting up our software development environment!

Key Points

In order to develop (write, test, debug, backup) code efficiently, you need to use a number of different tools.

When there is a choice of tools for a task you will have to decide which tool is right for you, which may be a matter of personal preference or what the community you belong to is using.

Introduction to a Software Project

Overview

Teaching: 20 min
Exercises: 10 min

Questions

What is a design architecture of a software project?

Why is splitting code into smaller functional units (modules) good when designing software?

Objectives

Use Git to obtain a working copy of our template software project from GitHub.

Inspect the structure and architecture of our software project.

Understand Model-View-Controller (MVC) architecture in software design and its use in our project.

Our Software Project

So, you have joined a software development team that has been working on the patient inflammation project developed in Python and stored on GitHub. The software project studies inflammation in patients who have been given a new treatment for arthritis and reuses the inflammation dataset from the novice Software Carpentry Python lesson. The dataset contains information for 60 patients, who had their inflammation levels recorded for 40 days (a snapshot of data is below).

Snapshot of the inflammation dataset

The project analyses the data to study the effect of the new arthritis treatment by checking the inflammation records across all patients but it is not finished and contains some errors. You will be working on your own and in collaboration with others to fix and build on top of the existing code during the course.

To start with the development, we have to obtain a local copy of the project on your machine and inspect it. The first step is to create a copy of the software project repository from GitHub within your own GitHub account:

Log into your GitHub account.
Go to the template repository URL.
Click the Use this template button towards the top right of the template repository’s GitHub page to create a copy of the repository under your GitHub account (you will need to be signed into GitHub to see the Use this template button). Note that each participant is creating their own copy to work on. Also, we are not forking the directory but creating a copy (remember - you can have only one fork but can have multiple copies of a repository in GitHub).
Make sure to select your personal account and set the name of the project to python-intermediate-inflammation (you can call it anything you like, but it may be easier for future group exercises if everyone uses the same name). Also set the new repository’s visibility to ‘Public’ - so it can be seen by others and by third-party Continuous Integration (CI) services (to be covered later on in the course).
Click the Create repository from template button and wait for GitHub to import the copy of the repository under your account.
Locate the copied repository under your own GitHub account.

Obtain the Software Project Locally

Using the command line, clone the copied repository from your GitHub account into the home directory on your computer, (to be consistent with the code examples and exercises in the course). Which command(s) would you use to get a detailed list of contents of the directory you have just cloned?

Solution

Find the URL of the software project repository to clone from your GitHub account. Make sure you do not clone the original template repository but rather your own copy, as you should be able to push commits to it later on.

Make sure you are located in your home directory in the command line with: cd ~

From your home directory, do: git clone https://github.com/<YOUR_GITHUB_USERNAME>/python-intermediate-inflammation. Make sure you are cloning your copy of the software project and not the template repo.

Navigate into the cloned repository in your command line with: cd python-intermediate-inflammation

List the contents of the directory: ls -l. Remember the -l flag of the ls command and also how to get help for commands in the command line using manual pages, e.g.: man ls.

Our Software Project Structure

Let’s inspect the content of the software project from the command line. From the root directory of the project, you can use the command ls -l to get a more detailed list of the contents. You should see something similar to the following.

$ cd ~/python-intermediate-inflammation
$ ls -l
total 24
-rw-r--r--   1 carpentry  users  1055 20 Apr 15:41 README.md
drwxr-xr-x  18 carpentry  users   576 20 Apr 15:41 data
drwxr-xr-x   5 carpentry  users   160 20 Apr 15:41 inflammation
-rw-r--r--   1 carpentry  users  1122 20 Apr 15:41 inflammation-analysis.py
drwxr-xr-x   4 carpentry  users   128 20 Apr 15:41 tests

As can be seen from the above, our software project contains the README file (that typically describes the project, its usage, installation, authors and how to contribute), Python script inflammation-analysis.py, and three directories - inflammation, data and tests.

The Python script inflammation-analysis.py provides the main entry point in the application, and on closer inspection, we can see that the inflammation directory contains two more Python scripts - views.py and models.py. We will have a more detailed look into these shortly.

$ ls -l inflammation
total 24
-rw-r--r--  1 alex  staff   71 29 Jun 09:59 __init__.py
-rw-r--r--  1 alex  staff  838 29 Jun 09:59 models.py
-rw-r--r--  1 alex  staff  649 25 Jun 13:13 views.py

Directory data contains several files with patients’ daily inflammation information.

$ ls -l data
total 264
-rw-r--r--  1 alex  staff   5365 25 Jun 13:13 inflammation-01.csv
-rw-r--r--  1 alex  staff   5314 25 Jun 13:13 inflammation-02.csv
-rw-r--r--  1 alex  staff   5127 25 Jun 13:13 inflammation-03.csv
-rw-r--r--  1 alex  staff   5367 25 Jun 13:13 inflammation-04.csv
-rw-r--r--  1 alex  staff   5345 25 Jun 13:13 inflammation-05.csv
-rw-r--r--  1 alex  staff   5330 25 Jun 13:13 inflammation-06.csv
-rw-r--r--  1 alex  staff   5342 25 Jun 13:13 inflammation-07.csv
-rw-r--r--  1 alex  staff   5127 25 Jun 13:13 inflammation-08.csv
-rw-r--r--  1 alex  staff   5327 25 Jun 13:13 inflammation-09.csv
-rw-r--r--  1 alex  staff   5342 25 Jun 13:13 inflammation-10.csv
-rw-r--r--  1 alex  staff   5127 25 Jun 13:13 inflammation-11.csv
-rw-r--r--  1 alex  staff   5340 25 Jun 13:13 inflammation-12.csv
-rw-r--r--  1 alex  staff  22554 25 Jun 13:13 python-novice-inflammation-data.zip
-rw-r--r--  1 alex  staff     12 25 Jun 13:13 small-01.csv
-rw-r--r--  1 alex  staff     15 25 Jun 13:13 small-02.csv
-rw-r--r--  1 alex  staff     12 25 Jun 13:13 small-03.csv

The data is stored in a series of comma-separated values (CSV) format files, where:

each row holds temperature measurements for a single patient (in some arbitrary units of inflammation),
columns represent successive days.

Have a Peek at the Data

Which command(s) would you use to list the contents or a first few lines of data/inflammation-01.csv file?
Solution

To list the entire content from the project root do: cat data/inflammation-01.csv.

To list the first 5 lines from the project root do: head -n 5 data/inflammation-01.csv.
0,0,1,3,2,3,6,4,5,7,2,4,11,11,3,8,8,16,5,13,16,5,8,8,6,9,10,10,9,3,3,5,3,5,4,5,3,3,0,1
0,1,1,2,2,5,1,7,4,2,5,5,4,6,6,4,16,11,14,16,14,14,8,17,4,14,13,7,6,3,7,7,5,6,3,4,2,2,1,1
0,1,1,1,4,1,6,4,6,3,6,5,6,4,14,13,13,9,12,19,9,10,15,10,9,10,10,7,5,6,8,6,6,4,3,5,2,1,1,1
0,0,0,1,4,5,6,3,8,7,9,10,8,6,5,12,15,5,10,5,8,13,18,17,14,9,13,4,10,11,10,8,8,6,5,5,2,0,2,0
0,0,1,0,3,2,5,4,8,2,9,3,3,10,12,9,14,11,13,8,6,18,11,9,13,11,8,5,5,2,8,5,3,5,4,1,3,1,1,0

Directory tests contains several tests that have been implemented already. We will be adding more tests during the course as our code grows.

An important thing to note here is that the structure of the project is not arbitrary. One of the big differences between novice and intermediate software development is planning the structure of your code. This structure includes software components and behavioural interactions between them (including how these components are laid out in a directory and file structure). A novice will often make up the structure of their code as they go along. However, for more advanced software development, we need to plan this structure - called a software architecture - beforehand.

Let’s have a more detailed look into what a software architecture is and which architecture is used by our software project before we start adding more code to it.

Software Architecture

A software architecture is the fundamental structure of a software system that is decided at the beginning of project development and cannot be changed that easily once implemented. It refers to a “bigger picture” of a software system that describes high-level components (modules) of the system and how they interact.

In software design and development, large systems or programs are often decomposed into a set of smaller modules each with a subset of functionality. Typical examples of modules in programming are software libraries; some software libraries, such as numpy and matplotlib in Python, are bigger modules that contain several smaller sub-modules. Another example of modules are classes in object-oriented programming languages.

Programming Modules and Interfaces

Although modules are self-contained and independent elements to a large extent (they can depend on other modules), there are well-defined ways of how they interact with one another. These rules of interaction are called programming interfaces - they define how other modules (clients) can use a particular module. Typically, an interface to a module includes rules on how a module can take input from and how it gives output back to its clients. A client can be a human, in which case we also call these user interfaces. Even smaller functional units such as functions/methods have clearly defined interfaces - a function/method’s definition (also known as a signature) states what parameters it can take as input and what it returns as an output.

There are various software architectures around defining different ways of dividing the code into smaller modules with well defined roles, for example:

Model–View–Controller (MVC) architecture, which we will look into in detail and use for our software project,
Service-Oriented Architecture (SOA), which separates code into distinct services, accessible over a network by consumers (users or other services) that communicate with each other by passing data in a well-defined, shared format (protocol),
Client-Server architecture, where clients request content or service from a server, initiating communication sessions with servers, which await incoming requests (e.g. email, network printing, the Internet),
Multilayer architecture, is a type of architecture in which presentation, application processing and data management functions are split into distinct layers and may even be physically separated to run on separate machines - some more detail on this later in the course.

Model-View-Controller (MVC) Architecture

MVC architecture divides the related program logic into three interconnected modules:

Model (data)
View (client interface), and
Controller (processes that handle input/output and manipulate the data).

Model represents the data used by a program and also contains operations/rules for manipulating and changing the data in the model. This may be a database, a file, a single data object or a series of objects - for example a table representing patients’ data.

View is the means of displaying data to users/clients within an application (i.e. provides visualisation of the state of the model). For example, displaying a window with input fields and buttons (Graphical User Interface, GUI) or textual options within a command line (Command Line Interface, CLI) are examples of Views. They include anything that the user can see from the application. While building GUIs is not the topic of this course, we will cover building CLIs in Python in later episodes.

Controller manipulates both the Model and the View. It accepts input from the View and performs the corresponding action on the Model (changing the state of the model) and then updates the View accordingly. For example, on user request, Controller updates a picture on a user’s GitHub profile and then modifies the View by displaying the updated profile back to the user.

MVC Examples

MVC architecture can be applied in scientific applications in the following manner. Model comprises those parts of the application that deal with some type of scientific processing or manipulation of the data, e.g. numerical algorithm, simulation, DNA. View is a visualisation, or format, of the output, e.g. graphical plot, diagram, chart, data table, file. Controller is the part that ties the scientific processing and output parts together, mediating input and passing it to the model or view, e.g. command line options, mouse clicks, input files. For example, the diagram below depicts the use of MVC architecture for the DNA Guide Graphical User Interface application.

MVC example of a DNA Guide Graphical User Interface application

MVC Application Examples From your Work

Think of some other examples from your work or life where MVC architecture may be suitable or have a discussion with your fellow learners.

Solution

MVC architecture is a popular choice when designing web and mobile applications. Users interact with a web/mobile application by sending various requests to it. Forms to collect users inputs/requests together with the info returned and displayed to the user as a result represent the View. Requests are processed by the Controller, which interacts with the Model to retrieve or update the underlying data. For example, a user may request to view its profile. The Controller retrieves the account information for the user from the Model and passes it to the View for rendering. The user may further interact with the application by asking it to update its personal information. Controller verifies the correctness of the information (e.g. the password satisfies certain criteria, postal address and phone number are in the correct format, etc.) and passes it to the Model for permanent storage. The View is then updated accordingly and the user sees its updated profile details.

Note that not everything fits into the MVC architecture but it is still good to think about how things could be split into smaller units. For a few more examples, have a look at this short article on MVC from CodeAcademy.

Separation of Concerns

Separation of concerns is important when designing software architectures in order to reduce the code’s complexity. Note, however, there are limits to everything - and MVC architecture is no exception. Controller often transcends into Model and View and a clear separation is sometimes difficult to maintain. For example, the Command Line Interface provides both the View (what user sees and how they interact with the command line) and the Controller (invoking of a command) aspects of a CLI application. In Web applications, Controller often manipulates the data (received from the Model) before displaying it to the user or passing it from the user to the Model.

Our Project’s MVC Architecture

Our software project uses the MVC architecture. The file inflammation-analysis.py is the Controller module that performs basic statistical analysis over patient data and provides the main entry point into the application. The View and Model modules are contained in the files view.py and model.py, respectively, and are conveniently named. Data underlying the Model is contained within the directory data - as we have seen already it contains several files with patients’ daily inflammation information.

We will revisit the software architecture and MVC topics once again in a later episode when we talk in more detail about software design. We now proceed to set up our virtual development environment and start working with the code using a more convenient graphical tool - IDE PyCharm.

Key Points

Programming interfaces define how individual modules within a software application interact among themselves or how the application itself interacts with its users.

MVC is a software design architecture which divides the application into three interconnected modules: Model (data), View (user interface), and Controller (input/output and data manipulation).

The software project we use throughout this course is an example of an MVC application that manipulates patients’ inflammation data and performs basic statistical analysis using Python.

Virtual Environments For Software Development

Overview

Teaching: 30 min
Exercises: 0 min

Questions

What are virtual environments in software development and why you should use them?

How can we manage Python virtual environments and external (third-party) libraries?

Objectives

Set up a Python virtual environment for our software project using venv and pip.

Run our software from the command line.

Introduction

So far we have checked out our software project from GitHub and inspected its contents and architecture a bit. We now want to run our code to see what it does - let’s do that from the command line. For the most part of the course we will run our code and interact with Git from the command line, and, while we will develop and debug our code using the PyCharm IDE and it is possible to use Git from PyCharm too, typing commands in the command line ‘forces’ you to familiarise yourself and learn it well. A bonus is that this knowledge is transferable to running code in other programming languages and is independent from any IDE you may use in the future.

If you have a little peak into our code (e.g. do cat inflammation/views.py from the project root), you will see the following two lines somewhere at the top.

from matplotlib import pyplot as plt
import numpy as np

This means that our code requires two external libraries (also called third-party packages or dependencies) - numpy and matplotlib. Python applications often use external libraries that don’t come as part of the standard Python distribution. This means that you will have to use a package manager tool to install them on your system. Applications will also sometimes need a specific version of an external library (e.g. because they require that a particular bug has been fixed in a newer version of the library), or a specific version of Python interpreter. This means that each Python application you work with may require a different setup and a set of dependencies so it is important to be able to keep these configurations separate to avoid confusion between projects. The solution for this problem is to create a self-contained virtual environment per project, which contains a particular version of Python installation plus a number of additional external libraries.

Virtual environments are not just a feature of Python - all modern programming languages use them to isolate code of a specific project and make it easier to develop, run, test and share code with others. In this episode, we learn how to set up a virtual environment to develop our code and manage our external dependencies.

Virtual Environments

So what exactly are virtual environments, and why use them?

A Python virtual environment is an isolated working copy of a specific version of Python interpreter together with specific versions of a number of external libraries installed into that virtual environment. A virtual environment is simply a directory with a particular structure which includes links to and enables multiple side-by-side installations of different Python interpreters or different versions of the same external library to coexist on your machine and only one to be selected for each of our projects. This allows you to work on a particular project without worrying about affecting other projects on your machine.

As more external libraries are added to your Python project over time, you can add them to its specific virtual environment and avoid a great deal of confusion by having separate (smaller) virtual environments for each project rather than one huge global environment with potential package version clashes. Another big motivator for using virtual environments is that they make sharing your code with others much easier (as we will see shortly). Here are some typical scenarios where the usage of virtual environments is highly recommended (almost unavoidable):

You have an older project that only works under Python 2. You do not have the time to migrate the project to Python 3 or it may not even be possible as some of the third party dependencies are not available under Python 3. You have to start another project under Python 3. The best way to do this on a single machine is to set up two separate Python virtual environments.
One of your Python 3 projects is locked to use a particular older version of a third party dependency. You cannot use the latest version of the dependency as it breaks things in your project. In a separate branch of your project, you want to try and fix problems introduced by the new version of the dependency without affecting the working version of your project. You need to set up a separate virtual environment for your branch to ‘isolate’ your code while testing the new feature.

You do not have to worry too much about specific versions of external libraries that your project depends on most of the time. Virtual environments enable you to always use the latest available version without specifying it explicitly. They also enable you to use a specific older version of a package for your project, should you need to.

A Specific Python or Package Version is Only Ever Installed Once

Note that you will not have a separate Python or package installations for each of your projects - they will only ever be installed once on your system but will be referenced from different virtual environments.

Managing Python Virtual Environments

There are several commonly used command line tools for managing Python virtual environments:

venv, available by default from the standard Python distribution from Python 3.3+
virtualenv, needs to be installed separately but supports both Python 2.7+ and Python 3.3+
pipenv, created to fix certain shortcomings of virtualenv
conda which comes together with the Anaconda Python distribution

While there are pros and cons for using each of the above, all will do the job of managing Python virtual environments for you and it may be a matter of personal preference which one you go for. In this course, we will use venv to create and manage our virtual environment (which is the preferred way for Python 3.3+). The upside is that venv virtual environments created from the command line are also recognised and picked up automatically by PyCharm IDE, as we will see in the next episode.

Managing Python Packages

Part of managing your (virtual) working environment involves installing, updating and removing external packages on your system. The Python package manager tool pip is most commonly used for this - it interacts and obtains the packages from the central repository called Python Package Index (PyPI). pip can now be used with all Python distributions (including Anaconda).

A Note on Anaconda and conda

Anaconda is an open source Python distribution commonly used for scientific programming - it conveniently installs Python and a number of commonly used scientific computing packages so you do not have to obtain them separately. conda (that comes with the Anaconda distribution) is a command line tool with dual functionality: (1) it is a package manager that helps you find Python packages from remote package repositories and install them on your system, and (2) it is also a virtual environment manager. So, if you are using Anaconda Python distribution, you can use conda for both tasks instead of using venv and pip.

Many Tools for the Job

Installing and managing Python distributions, external libraries and virtual environments is, well, complex. There is an abundance of tools for each task, each with its advantages and disadvantages, and there are different ways to achieve the same effect (and even different ways to install the same tool!). Note that each Python distribution comes with its own version of pip - and if you have several Python versions installed you have to be extra careful to use the correct pip to manage external packages for that Python version.

venv and pip are considered the de facto standards for virtual environment and package management for Python 3. However, the advantages of using Anaconda and conda are that you get (most of the) packages needed for scientific code development included with the distribution. If you are only collaborating with others who are also using Anaconda, you may find that conda satisfies all your needs. It is good, however, to be aware of all these tools, and use them accordingly. As you become more familiar with them you will realise that equivalent tools work in a similar way even though the command syntax may be different (and that there are equivalent tools for other programming languages too to which your knowledge can be ported).

Python environment hell XKCD comic

Python Environment Hell
From XKCD (Creative Commons Attribution-NonCommercial 2.5 License)

Let us have a look at how we can create and manage virtual environments from the command line using venv and manage packages using pip.

Creating a `venv` Environment

Creating a virtual environment with venv is done by executing the following command:

$ python3 -m venv /path/to/new/virtual/environment

where /path/to/new/virtual/environment is a path to a directory where you want to place it - conventionally within your software project so they are co-located. This will create the target directory for the virtual environment (and any parent directories that don’t exist already).

For our project, let’s create a virtual environment called venv off the project root:

$ python3 -m venv venv

If you list the contents of the newly created venv directory, you should see something like:

$ ls -l venv

total 8
drwxr-xr-x  12 alex  staff  384  5 Oct 11:47 bin
drwxr-xr-x   2 alex  staff   64  5 Oct 11:47 include
drwxr-xr-x   3 alex  staff   96  5 Oct 11:47 lib
-rw-r--r--   1 alex  staff   90  5 Oct 11:47 pyvenv.cfg

So, running the python3 -m venv venv command created the target directory called venv containing:

pyvenv.cfg configuration file with a home key pointing to the Python installation from which the command was run,
bin subdirectory (Scripts on Windows) containing a symlink of the Python interpreter binary used to create the environment and the standard Python library,
lib/pythonX.Y/site-packages subdirectory (Lib\site-packages on Windows) to contain its own independent set of installed Python packages isolated from other projects,
various other configuration and supporting files and subdirectories.

Naming Virtual Environments

What is a good name to use for a virtual environment? Using “venv” or “.venv” as the name for an environment and storing it within the project’s directory seems to be the recommended way - this way when you come across such a subdirectory within a software project, by convention you know it contains its virtual environment details. A slight downside is that all different virtual environments on your machine then use the same name and the current one is determined by the context of the path you are currently located in. A (non-conventional) alternative is to use your project name for the name of the virtual environment, with the downside that there is nothing to indicate that such a directory contains a virtual environment. In our case, we have settled to use “venv” since it’s not a hidden directory and will be displayed by the command line when listing directory contents - in the future, you will decide what naming convention works best for you. Here are some references for each of the naming conventions:

The Hitchhiker’s Guide to Python notes that “venv” is the general convention used globally

The Python Documentation indicates that “.venv” is common

“venv” vs “.venv” discussion

Once you’ve created a virtual environment, you will need to activate it:

$ source venv/bin/activate
(venv) $

Activating the virtual environment will change your command line’s prompt to show what virtual environment you are currently using (indicated by its name in round brackets at the start of the prompt), and modify the environment so that running Python will get you the particular version of Python configured in your virtual environment.

You can verify you are using your virtual environment’s version of Python by checking the path using which:

(venv) $ which python3

/home/alex/python-intermediate-inflammation/venv/bin/python3

When you’re done working on your project, you can exit the environment with:

(venv) $ deactivate

If you’ve just done the deactivate, ensure you reactivate the environment ready for the next part:

source venv/bin/activate
(venv) $

Python Within A Virtual Environment

On Mac and Linux, within a virtual environment python and pip will refer to the version of Python you created the environment with. If you create a virtual environment with python3 -m venv venv, python will refer to python3 and pip will refer to pip3.

On some Windows machines with Python 2 installed, python will refer to the copy of Python 2 installed outside of the virtual environment instead. You can always check if you are using the version of Python in your virtual environment with the command which python.

We continue using python3 and pip3 in this material to avoid confusion for those Windows users.

Note that, since our software project is being tracked by Git, the newly created virtual environment will show up in version control - we will see how to handle it using Git in one of the subsequent episodes.

Installing External Libraries in an Environment with `pip`

We noticed earlier that our code depends on two external libraries - numpy and matplotlib. In order for the code to run on your machine, you need to install these two dependencies into your virtual environment.

To install the latest version of a package with pip you use pip’s install command and specify the package’s name, e.g.:

(venv) $ pip3 install numpy
(venv) $ pip3 install matplotlib

or like this to install multiple packages at once for short:

(venv) $ pip3 install numpy matplotlib

How About python3 -m pip install?

Why are we not using pip as an argument to python3 command, in the same way we did with venv (i.e. python3 -m venv)? python3 -m pip install should be used according to the official Pip documentation; other official documentation still seems to have a mixture of usages. Core Python developer Brett Cannon offers a more detailed explanation of edge cases when the two options may produce different results and recommends python3 -m pip install. We kept the old-style command (pip3 install) as it seems more prevalent among developers at the moment - but it may be a convention that will soon change and certainly something you should consider.

If you run the pip3 install command on a package that is already installed, pip will notice this and do nothing.

To install a specific version of a Python package give the package name followed by == and the version number, e.g. pip3 install numpy==1.21.1.

To specify a minimum version of a Python package, you can do pip3 install numpy>=1.20.

To upgrade a package to the latest version, e.g. pip3 install --upgrade numpy.

To display information about a particular installed package do:

(venv) $ pip3 show numpy

Name: numpy
Version: 1.21.2
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: None
License: BSD
Location: /Users/alex/work/SSI/Carpentries/python-intermediate-inflammation/inflammation/lib/python3.9/site-packages
Requires:
Required-by: matplotlib

To list all packages installed with pip (in your current virtual environment):

(venv) $ pip3 list

Package         Version
--------------- -------
cycler          0.11.0
fonttools       4.28.1
kiwisolver      1.3.2
matplotlib      3.5.0
numpy           1.21.4
packaging       21.2
Pillow          8.4.0
pip             21.1.3
pyparsing       2.4.7
python-dateutil 2.8.2
setuptools      57.0.0
setuptools-scm  6.3.2
six             1.16.0
tomli           1.2.2

To uninstall a package installed in the virtual environment do: pip3 uninstall package-name. You can also supply a list of packages to uninstall at the same time.

Exporting/Importing an Environment with `pip`

You are collaborating on a project with a team so, naturally, you will want to share your environment with your collaborators so they can easily ‘clone’ your software project with all of its dependencies and everyone can replicate equivalent virtual environments on their machines. pip has a handy way of exporting, saving and sharing virtual environments.

To export your active environment - use pip freeze command to produce a list of packages installed in the virtual environment. A common convention is to put this list in a requirements.txt file:

(venv) $ pip3 freeze > requirements.txt
(venv) $ cat requirements.txt

cycler==0.11.0
fonttools==4.28.1
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
packaging==21.2
Pillow==8.4.0
pyparsing==2.4.7
python-dateutil==2.8.2
setuptools-scm==6.3.2
six==1.16.0
tomli==1.2.2

The first of the above commands will create a requirements.txt file in your current directory. The requirements.txt file can then be committed to a version control system (we will see how to do this using Git in one of the following episodes) and get shipped as part of your software and shared with collaborators and/or users. They can then replicate your environment and install all the necessary packages from the project root as follows:

(venv) $ pip3 install -r requirements.txt

As your project grows - you may need to update your environment for a variety of reasons. For example, one of your project’s dependencies has just released a new version (dependency version number update), you need an additional package for data analysis (adding a new dependency) or you have found a better package and no longer need the older package (adding a new and removing an old dependency). What you need to do in this case (apart from installing the new and removing the packages that are no longer needed from your virtual environment) is update the contents of the requirements.txt file accordingly by re-issuing pip freeze command and propagate the updated requirements.txt file to your collaborators via your code sharing platform (e.g. GitHub).

Official Documentation

For a full list of options and commands, consult the official venv documentation and the Installing Python Modules with pip guide. Also check out the guide “Installing packages using pip and virtual environments”.

Running Python Scripts From Command Line

Congratulations! Your environment is now activated and set up to run our inflammation-analysis.py script from the command line.

You should already be located in the root of the python-intermediate-inflammation directory (if not, please navigate to it from the command line now). To run the script, type the following command:

(venv) $ python3 inflammation-analysis.py

usage: inflammation-analysis.py [-h] infiles [infiles ...]
inflammation-analysis.py: error: the following arguments are required: infiles

In the above command, we tell the command line two things:

to find a Python interpreter (in this case, the one that was configured via the virtual environment), and
to use it to run our script inflammation-analysis.py, which resides in the current directory.

As we can see, the Python interpreter ran our script, which threw an error - inflammation-analysis.py: error: the following arguments are required: infiles. It looks like the script expects a list of input files to process, so this is expected behaviour since we don’t supply any.

Key Points

Virtual environments keep Python versions and dependencies required by different projects separate.

A virtual environment is itself a directory structure.

Use venv to create and manage Python virtual environments.

Use pip to install and manage Python external (third-party) libraries.

pip allows you to declare all dependencies for a project in a separate file (by convention called requirements.txt) which can be shared with collaborators/users and used to replicate a virtual environment.

Use pip3 freeze > requirements.txt to take snapshot of your project’s dependencies.

Use pip3 install -r requirements.txt to replicate someone else’s virtual environment on your machine from the requirements.txt file.

Integrated Software Development Environments

Overview

Teaching: 25 min
Exercises: 15 min

Questions

What are Integrated Development Environments (IDEs)?

What are the advantages of using IDEs for software development?

Objectives

Set up a (virtual) development environment in PyCharm

Use PyCharm to run a Python script

Introduction

As we have seen in the previous episode - even a simple software project is typically split into smaller functional units and modules which are kept in separate files and subdirectories. As your code starts to grow and becomes more complex, it will involve many different files and various external libraries. You will need an application to help you manage all the complexities of, and provide you with some useful (visual) facilities for, the software development process. Such clever and useful graphical software development applications are called Integrated Development Environments (IDEs).

Integrated Development Environments (IDEs)

An IDE normally consists of at least a source code editor, build automation tools and a debugger. The boundaries between modern IDEs and other aspects of the broader software development process are often blurred as nowadays IDEs also offer version control support, tools to construct graphical user interfaces (GUI) and web browser integration for web app development, source code inspection for dependencies and many other useful functionalities. The following is a list of the most commonly seen IDE features:

syntax highlighting - to show the language constructs, keywords and the syntax errors with visually distinct colours and font effects
code completion - to speed up programming by offering a set of possible (syntactically correct) code options
code search - finding package, class, function and variable declarations, their usages and referencing
version control support - to interact with source code repositories
debugging - for setting breakpoints in the code editor, step-by-step execution of code and inspection of variables

IDEs are extremely useful and modern software development would be very hard without them. There are a number of IDEs available for Python development; a good overview is available from the Python Project Wiki. In addition to IDEs, there are also a number of code editors that have Python support. Code editors can be as simple as a text editor with syntax highlighting and code formatting capabilities (e.g. GNU EMACS, Vi/Vim, Atom). Most good code editors can also execute code and control a debugger, and some can also interact with a version control system. Compared to an IDE, a good dedicated code editor is usually smaller and quicker, but often less feature-rich. You will have to decide which one is the best for you - in this course we will learn how to use PyCharm, a free, open source Python IDE. Some popular alternatives include free and open source IDE Spyder and Microsoft’s free Visual Studio Code.

Using the PyCharm IDE

Let’s open our project in PyCharm now and familiarise ourselves with some commonly used features.

Opening a Software Project

If you don’t have PyCharm running yet, start it up now. You can skip the initial configuration steps which just go through selecting a theme and other aspects. You should be presented with a dialog box that asks you what you want to do, e.g. Create New Project, Open, or Check out from Version Control.

Select Open and find the software project directory python-intermediate-inflammation you cloned earlier. This directory is now the current working directory for PyCharm, so when we run scripts from PyCharm, this is the directory they will run from.

PyCharm will show you a ‘Tip of the Day’ window which you can safely ignore and close for now. You may also get a warning ‘No Python interpreter configured for the project’ - we will deal with this shortly after we familiarise ourselves with the PyCharm environment. You will notice the IDE shows you a project/file navigator window on the left hand side, to traverse and select the files (and any subdirectories) within the working directory, and an editor window on the right. At the bottom, you would typically have a panel for version control, terminal (the command line within PyCharm) and a TODO list.

View of an opened project in PyCharm

Select the inflammation-analysis.py file in the project navigator on the left so that its contents are displayed in the editor window. You may notice a warning about the missing Python interpreter at the top of the editor panel showing inflammation-analysis.py file - this is one of the first things you will have to configure for your project before you can do any work.

Missing Python Interpreter Warning in PyCharm

You may take the shortcut and click on one of the offered options above but we want to take you through the whole process of setting up your environment in PyCharm as this is important conceptually.

Configuring a Virtual Environment in PyCharm

Before you can run the code from PyCharm, you need to explicitly specify the path to the Python interpreter on your system. The same goes for any dependencies your code may have - you need to tell PyCharm where to find them - much like we did from the command line in the previous episode. Luckily for us, we have already set up a virtual environment for our project from the command line and PyCharm is clever enough to understand it.

Adding a Python Interpreter

Select either PyCharm > Preferences (Mac) or File > Settings (Linux, Windows).
In the preferences window that appears, select Project: python-intermediate-inflammation > Python Interpreter from the left. You’ll see a number of Python packages displayed as a list, and importantly above that, the current Python interpreter that is being used. These may be blank or set to <No interpreter>, or possibly the default version of Python installed on your system, e.g. Python 2.7 /usr/bin/python2.7, which we do not want to use in this instance.
Select the cog-like button in the top right, then Add Local... (or Add... depending on your PyCharm version). An Add Python Interpreter window will appear.
Select Virtualenv from the list on the left and ensure that Existing environment checkbox is selected within the popup window. In the Interpreter field point to the Python 3 executable inside your virtual environment’s bin directory (make sure you navigate to it and select it from the file browser rather than just accept the default offered by PyCharm). Note that there is also an option to create a new virtual environment, but we are not using that option as we want to reuse the one we created from the command line in the previous episode.
Select Make available to all projects checkbox so we can also use this environment for other projects if we wish.
Select OK in the Add Python Interpreter window. Back in the Preferences window, you should select “Python 3.9 (python-intermediate-inflammation)” or similar (that you’ve just added) from the Project Interpreter drop-down list.

Note that a number of external libraries have magically appeared under the “Python 3.9 (python-intermediate-inflammation)” interpreter, including numpy and matplotlib. PyCharm has recognised the virtual environment we created from the command line using venv and has added these libraries effectively replicating our virtual environment in PyCharm (referred to as “Python 3.9 (python-intermediate-inflammation)”).

Packages Currently Installed in a Virtual Environment in PyCharm

Also note that, although the names are not the same - this is one and the same virtual environment and changes done to it in PyCharm will propagate to the command line and vice versa. Let’s see this in action through the following exercise.

Compare External Libraries in the Command Line and PyCharm

Can you recall two places where information about our project’s dependencies can be found from the command line? Compare that information with the equivalent configuration in PyCharm.

Hint: We can use an argument to pip, or find the packages directly in a subdirectory of our virtual environment directory venv.
Solution

From the previous episode, you may remember that we can get the list of packages in the current virtual environment using the pip3 list command:
(venv) $ pip3 list
Package         Version
--------------- -------
cycler          0.11.0
fonttools       4.28.1
kiwisolver      1.3.2
matplotlib      3.5.0
numpy           1.21.4
packaging       21.2
Pillow          8.4.0
pip             21.1.3
pyparsing       2.4.7
python-dateutil 2.8.2
setuptools      57.0.0
setuptools-scm  6.3.2
six             1.16.0
tomli           1.2.2
However, pip3 list shows all the packages in the virtual environment - if we want to see only the list of packages that we installed, we can use the pip3 freeze command instead:
(venv) $ pip3 freeze
cycler==0.11.0
fonttools==4.28.1
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
packaging==21.2
Pillow==8.4.0
pyparsing==2.4.7
python-dateutil==2.8.2
setuptools-scm==6.3.2
six==1.16.0
tomli==1.2.2
We see pip in pip3 list but not in pip3 freeze as we did not install it using pip. Remember that we use pip3 freeze to update our requirements.txt file, to keep a list of the packages our virtual environment includes. Python will not do this automatically; we have to manually update the file when our requirements change using:
pip3 freeze > requirements.txt
If we want, we can also see the list of packages directly in the following subdirectory of venv:
(venv) $ ls -l venv/lib/python3.9/site-packages
total 1088
drwxr-xr-x  103 alex  staff    3296 17 Nov 11:55 PIL
drwxr-xr-x    9 alex  staff     288 17 Nov 11:55 Pillow-8.4.0.dist-info
drwxr-xr-x    6 alex  staff     192 17 Nov 11:55 __pycache__
drwxr-xr-x    5 alex  staff     160 17 Nov 11:53 _distutils_hack
drwxr-xr-x    8 alex  staff     256 17 Nov 11:55 cycler-0.11.0.dist-info
-rw-r--r--    1 alex  staff   14519 17 Nov 11:55 cycler.py
drwxr-xr-x   14 alex  staff     448 17 Nov 11:55 dateutil
-rw-r--r--    1 alex  staff     152 17 Nov 11:53 distutils-precedence.pth
drwxr-xr-x   31 alex  staff     992 17 Nov 11:55 fontTools
drwxr-xr-x    9 alex  staff     288 17 Nov 11:55 fonttools-4.28.1.dist-info
drwxr-xr-x    8 alex  staff     256 17 Nov 11:55 kiwisolver-1.3.2.dist-info
-rwxr-xr-x    1 alex  staff  216968 17 Nov 11:55 kiwisolver.cpython-39-darwin.so
drwxr-xr-x   92 alex  staff    2944 17 Nov 11:55 matplotlib
-rw-r--r--    1 alex  staff     569 17 Nov 11:55 matplotlib-3.5.0-py3.9-nspkg.pth
drwxr-xr-x   20 alex  staff     640 17 Nov 11:55 matplotlib-3.5.0.dist-info
drwxr-xr-x    7 alex  staff     224 17 Nov 11:55 mpl_toolkits
drwxr-xr-x   39 alex  staff    1248 17 Nov 11:55 numpy
drwxr-xr-x   11 alex  staff     352 17 Nov 11:55 numpy-1.21.4.dist-info
drwxr-xr-x   15 alex  staff     480 17 Nov 11:55 packaging
drwxr-xr-x   10 alex  staff     320 17 Nov 11:55 packaging-21.2.dist-info
drwxr-xr-x    8 alex  staff     256 17 Nov 11:53 pip
drwxr-xr-x   10 alex  staff     320 17 Nov 11:53 pip-21.1.3.dist-info
drwxr-xr-x    7 alex  staff     224 17 Nov 11:53 pkg_resources
-rw-r--r--    1 alex  staff      90 17 Nov 11:55 pylab.py
drwxr-xr-x    8 alex  staff     256 17 Nov 11:55 pyparsing-2.4.7.dist-info
-rw-r--r--    1 alex  staff  273365 17 Nov 11:55 pyparsing.py
drwxr-xr-x    9 alex  staff     288 17 Nov 11:55 python_dateutil-2.8.2.dist-info
drwxr-xr-x   41 alex  staff    1312 17 Nov 11:53 setuptools
drwxr-xr-x   11 alex  staff     352 17 Nov 11:53 setuptools-57.0.0.dist-info
drwxr-xr-x   19 alex  staff     608 17 Nov 11:55 setuptools_scm
drwxr-xr-x   10 alex  staff     320 17 Nov 11:55 setuptools_scm-6.3.2.dist-info
drwxr-xr-x    8 alex  staff     256 17 Nov 11:55 six-1.16.0.dist-info
-rw-r--r--    1 alex  staff   34549 17 Nov 11:55 six.py
drwxr-xr-x    8 alex  staff     256 17 Nov 11:55 tomli
drwxr-xr-x    7 alex  staff     224 17 Nov 11:55 tomli-1.2.2.dist-info
Finally, if you look at both the contents of venv/lib/python3.9/site-packages and requirements.txt and compare that with the packages shown in PyCharm’s Python Interpreter Configuration - you will see that they all contain equivalent information.

Adding an External Library

We have already added packages numpy and matplotlib to our virtual environment from the command line in the previous episode, so we are up-to-date with all external libraries we require at the moment. However, we will need library pytest soon to implement tests for our code so will use this opportunity to install it from PyCharm in order to see an alternative way of doing this and how it propagates to the command line.

Select either PyCharm > Preferences (Mac) or File > Settings (Linux, Windows).
In the preferences window that appears, select Project: python-intermediate-inflammation > Project Interpreter from the left.
Select the + icon at the top of the window. In the window that appears, search for the name of the library (pytest), select it from the list, then select Install Package.
Select OK in the Preferences window.

It may take a few minutes for PyCharm to install it. After it is done, the pytest library is added to our virtual environment. You can also verify this from the command line by listing the venv/lib/python3.9/site-packages subdirectory. Note, however, that requirements.txt is not updated - as we mentioned earlier this is something you have to do manually. Let’s do this as an exercise.

Update requirements.txt After Adding a New Dependency

Export the newly updated virtual environment into requirements.txt file.
Solution

Let’s verify first that the newly installed library pytest is appearing in our virtual environment but not in requirements.txt. First, let’s check the list of installed
(venv) $ pip3 list
Package         Version
--------------- -------
attrs           21.4.0 
cycler          0.11.0 
fonttools       4.28.5 
iniconfig       1.1.1  
kiwisolver      1.3.2  
matplotlib      3.5.1  
numpy           1.22.0 
packaging       21.3   
Pillow          9.0.0  
pip             20.0.2 
pluggy          1.0.0  
py              1.11.0 
pyparsing       3.0.7  
pytest          6.2.5  
python-dateutil 2.8.2  
setuptools      44.0.0 
six             1.16.0 
toml            0.10.2 
tomli           2.0.0  
We can see the pytest library appearing in the listing above. However, if we do:
(venv) $ cat requirements.txt
cycler==0.11.0
fonttools==4.28.1
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
packaging==21.2
Pillow==8.4.0
pyparsing==2.4.7
python-dateutil==2.8.2
setuptools-scm==6.3.2
six==1.16.0
tomli==1.2.2
pytest is missing from requirements.txt. To add it, we need to update the file by repeating the command:
(venv) $ pip3 freeze > requirements.txt
pytest is now present in requirements.txt:
attrs==21.2.0
cycler==0.11.0
fonttools==4.28.1
iniconfig==1.1.1
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
packaging==21.2
Pillow==8.4.0
pluggy==1.0.0
py==1.11.0
pyparsing==2.4.7
pytest==6.2.5
python-dateutil==2.8.2
setuptools-scm==6.3.2
six==1.16.0
toml==0.10.2
tomli==1.2.2

Adding a Run Configuration for Our Project

Having configured a virtual environment, we now need to tell PyCharm to use it for our project. This is done by adding a Run Configuration to a project:

To add a new configuration for a project - select Run > Edit Configurations... from the top menu.
Select Add new run configuration... then Python.
In the new popup window, in the Script path field select the folder button and find and select inflammation-analysis.py. This tells PyCharm which script to run (i.e. what the main entry point to our application is).
In the same window, select “Python 3.9 (python-intermediate-inflammation)” in the Python interpreter field.
You can give this run configuration a name at the top of the window if you like - e.g. let’s name it inflammation.
You can optionally configure run parameters and environment variables in the same window - we do not need this at the moment.
Select Apply to confirm these settings.

Virtual Environments & Run Configurations in PyCharm

We configured the Python interpreter to use for our project by pointing PyCharm to the virtual environment we created from the command line (which also includes external libraries our code needs to run). Recall that you can create several virtual environments based on the same Python interpreter but with different external libraries - this is helpful when you need to develop different types of applications. For example, you can create one virtual environment based on Python 3.9 to develop Django Web applications and another virtual environment based on the same Python 3.9 to work with scientific libraries.

Run Configurations in PyCharm are named sets of startup properties that define what to execute and what parameters (i.e. what additional configuration options) to use on top of virtual environments. You can vary these configurations each time your code is executed, which is particularly useful for running, debugging and testing your code.

Now you know how to configure and manipulate your environment in both tools (command line and PyCharm), which is a useful parallel to be aware of. Let’s have a look at some other features afforded to us by PyCharm.

Syntax Highlighting

The first thing you may notice is that code is displayed using different colours. Syntax highlighting is a feature that displays source code terms in different colours and fonts according to the syntax category the highlighted term belongs to. It also makes syntax errors visually distinct. Highlighting does not affect the meaning of the code itself - it’s intended only for humans to make reading code and finding errors easier.

Syntax Highlighting Functionality in PyCharm

Code Completion

As you start typing code, PyCharm will offer to complete some of the code for you in the form of an auto completion popup. This is a context-aware code completion feature that speeds up the process of coding (e.g. reducing typos and other common mistakes) by offering available variable names, functions from available packages, parameters of functions, hints related to syntax errors, etc.

Code Completion Functionality in PyCharm

Code Definition & Documentation References

You will often need code reference information to help you code. PyCharm shows this useful information, such as definitions of symbols (e.g. functions, parameters, classes, fields, and methods) and documentation references by means of quick popups and inline tooltips.

For a selected piece of code, you can access various code reference information from the View menu (or via various keyboard shortcuts), including:

Quick Definition - where and how symbols (functions, parameters, classes, fields, and methods) are defined
Quick Type Definition - type definition of variables, fields or any other symbols
Quick Documentation - inline documentation (docstrings) for any symbol created in accordance with PEP-257)
Parameter Info - the names of parameters in method and function calls
Type Info - type of an expression

Code References Functionality in PyCharm

Code Search

You can search for a text string within a project, use different scopes to narrow your search process, use regular expressions for complex searches, include/exclude certain files from your search, find usages and occurrences. To find a search string in the whole project:

From the main menu, select Edit | Find | Find in Path ... (or Edit | Find | Find in Files... depending on your version of PyCharm).
Type your search string in the search field of the popup. Alternatively, in the editor, highlight the string you want to find and press Command-Shift-F (on Mac) or Control-Shift-F (on Windows). PyCharm places the highlighted string into the search field of the popup.

If you need, specify the additional options in the popup. PyCharm will list the search strings and all the files that contain them.
Check the results in the preview area of the dialog where you can replace the search string or select another string, or press Command-Shift-F (on Mac) or Control-Shift-F (on Windows) again to start a new search.
To see the list of occurrences in a separate panel, click the Open in Find Window button in the bottom right corner. The find panel will appear at the bottom of the main window; use this panel and its options to group the results, preview them, and work with them further.

Version Control

PyCharm supports a directory-based versioning model, which means that each project directory can be associated with a different version control system. Our project was already under Git version control and PyCharm recognised it. It is also possible to add an unversioned project directory to version control directly from PyCharm.

During this course, we will do all our version control commands from the command line but it is worth noting that PyCharm supports a comprehensive subset of Git commands (i.e. it is possible to perform a set of common Git commands from PyCharm but not all). A very useful version control feature in PyCharm is graphically comparing changes you made locally to a file with the version of the file in a repository, a different commit version or a version in a different branch - this is something that cannot be done equally well from the text-based command line.

You can get a full documentation on PyCharm’s built-in version control support online.

Version Control Functionality in PyCharm

Running Scripts in PyCharm

We have configured our environment and explored some of the most commonly used PyCharm features and are now ready to run our script from PyCharm! To do so, right-click the inflammation-analysis.py file in the PyCharm project/file navigator on the left, and select Run 'inflammation'.

Running a script from PyCharm

The script will run in a terminal window at the bottom of the IDE window and display something like:

/Users/alex/work/python-intermediate-inflammation/venv/bin/python /Users/alex/work/python-intermediate-inflammation/inflammation-analysis.py
usage: inflammation-analysis.py [-h] infiles [infiles ...]
inflammation-analysis.py: error: the following arguments are required: infiles

Process finished with exit code 2

This is the same error we got when running the script from the command line. We will get back to this error shortly - for now, the good thing is that we managed to set up our project for development both from the command line and PyCharm and are getting the same outputs. Before we move on to fixing errors and writing more code, let’s have a look at the last set of tools for collaborative code development which we will be using in this course - Git and GitHub.

Key Points

An IDE is an application that provides a comprehensive set of facilities for software development, including syntax highlighting, code search and completion, version control, testing and debugging.

PyCharm recognises virtual environments configured from the command line using venv and pip.

Collaborative Software Development Using Git and GitHub

Overview

Teaching: 45 min
Exercises: 0 min

Questions

What are Git branches and why are they useful for code development?

What are some best practices when developing software collaboratively using Git?

Objectives

Commit changes in a software project to a local repository and publish them in a remote repository on GitHub

Create different branches for code development

Learn to use feature branch workflow to effectively collaborate with a team on a software project

Introduction

So far we have checked out our software project from GitHub and used command line tools to configure a virtual environment for our project and run our code. We have also familiarised ourselves with PyCharm - a graphical tool we will use for code development, testing and debugging. We are now going to start using another set of tools from the collaborative code development toolbox - namely, the version control system Git and code sharing platform GitHub. These two will enable us to track changes to our code and share it with others.

You may recall that we have already made some changes to our project locally - we created a virtual environment in venv directory and exported it to the requirements.txt file. We should now decide which of those changes we want to check in and share with others in our team. This is a typical software development workflow - you work locally on code, test it to make sure it works correctly and as expected, then record your changes using version control and share your work with others via a shared and centrally backed-up repository.

Firstly, let’s remind ourselves how to work with Git from the command line.

Git Refresher

Git is a version control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source code management in software development but it can be used to track changes in files in general - it is particularly effective for tracking text-based files (e.g. source code files, CSV, Markdown, HTML, CSS, Tex, etc. files).

Git has several important characteristics:

support for non-linear development allowing you and your colleagues to work on different parts of a project concurrently,
support for distributed development allowing for multiple people to be working on the same project (even the same file) at the same time,
every change recorded by Git remains part of the project history and can be retrieved at a later date, so even if you make a mistake you can revert to a point before it.

The diagram below shows a typical software development lifecycle with Git and the commonly used commands to interact with different parts of Git infrastructure, such as:

working directory - a directory (including any subdirectories) where your project files live and where you are currently working. It is also known as the “untracked” area of Git. Any changes to files will be marked by Git in the working directory. If you make changes to the working directory and do not explicitly tell Git to save them - you will likely lose those changes. Using git add filename command, you tell Git to start tracking changes to file filename within your working directory.
staging area (index) - once you tell Git to start tracking changes to files (with git add filename command), Git saves those changes in the staging area. Each subsequent change to the same file needs to be followed by another git add filename command to tell Git to update it in the staging area. To see what is in your working directory and staging area at any moment (i.e. what changes is Git tracking), run the command git status.
local repository - stored within the .git directory of your project, this is where Git wraps together all your changes from the staging area and puts them using the git commit command. Each commit is a new, permanent snapshot (checkpoint, record) of your project in time, which you can share or revert back to.
remote repository - this is a version of your project that is hosted somewhere on the Internet (e.g. on GitHub, GitLab or somewhere else). While your project is nicely version-controlled in your local repository, and you have snapshots of its versions from the past, if your machine crashes - you still may lose all your work. Working with a remote repository involves pushing your changes and pulling other people’s changes to keep your local repository in sync in order to collaborate with others and to backup your work on a different machine.

Development lifecycle with Git

Software development lifecycle with Git
From PNGWing (licenced for non-commercial reuse)

Checking-in Changes to Our Project

Let’s check-in the changes we have done to our project so far. The first thing to do upon navigating into our software project’s directory root is to check the current status of our local working directory and repository.

$ git status

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	requirements.txt
	venv/

nothing added to commit but untracked files present (use "git add" to track)

As expected, Git is telling us that we have some untracked files - requirements.txt and directory venv - present in our working directory which we have not staged nor committed to our local repository yet. You do not want to commit the newly created venv directory and share it with others because this directory is specific to your machine and setup only (i.e. it contains local paths to libraries on your system that most likely would not work on any other machine). You do, however, want to share requirements.txt with your team as this file can be used to replicate the virtual environment on your collaborators’ systems.

To tell Git to intentionally ignore and not track certain files and directories, you need to specify them in the .gitignore text file in the project root. Our project already has .gitignore, but in cases where you do not have it - you can simply create it yourself. In our case, we want to tell Git to ignore the venv directory (and .venv as another naming convention for virtual environments) and stop notifying us about it. Edit your .gitignore file in PyCharm and add a line containing “venv/” and another one containing “.venv/”. It does not matter much in this case where within the file you add these lines, so let’s do it at the end. Your .gitignore should look something like this:

# IDEs
.vscode/
.idea/

# Intermediate Coverage file
.coverage

# Output files
*.png

# Python runtime
*.pyc
*.egg-info
.pytest_cache

# Virtual environments
venv/
.venv/

You may notice that we are already not tracking certain files and directories with useful comments about what exactly we are ignoring. You may also notice that each line in .ignore is actually a pattern, so you can ignore multiple files that match a pattern (e.g. “*.png” will ignore all PNG files in the current directory).

If you run the git status command now, you will notice that Git has cleverly understood that you want to ignore changes to venv folder so it is not warning us about it any more. However, it has now detected a change to .gitignore file that needs to be committed.

$ git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   .gitignore

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	requirements.txt

no changes added to commit (use "git add" and/or "git commit -a")

To commit the changes .gitignore and requirements.txt to the local repository, we first have to add these files to staging area to prepare them for committing. We can do that at the same time as:

$ git add .gitignore requirements.txt

Now we can commit them to the local repository with:

$ git commit -m "Initial commit of requirements.txt. Ignoring virtual env. folder."

Remember to use meaningful messages for your commits.

So far we have been working in isolation - all the changes we have done are still only stored locally on our individual machines. In order to share our work with others - we should push our changes to the remote repository on GitHub. GitHub has recently strengthened authentication requirements for Git operations accessing GitHub from the command line over HTTPS. This means you cannot use passwords for authentication any more - you need to set up and use a personal access token for additional security before you can push remotely the changes you made locally. So, when you run the command below:

$ git push origin main

Git will prompt you to authenticate - enter your GitHub username and the previously generated access token as the password (you can also enable caching of the credentials so your machine remembers the access token). In the above command, origin is an alias for the remote repository you used when cloning the project locally (it is called that by convention and set up automatically by Git when you run git clone remote_url command to replicate a remote repository locally); main is the name of our main (and currently only) development branch.

Account Security

When using git config --global credential.helper cache, any password or personal access token you enter will be cached for a period of time, a default of 15 minutes. Re-entering a password every 15 minutes can be OK, but for a personal access token it can be inconvenient, and lead to you writing the token down elsewhere. To permanently store passwords or tokens, use stash instead of cache.

Storing an access token always carries a security risk. One compromise between short cache timescales and permanent stores, is to set a time-out on your personal access token when you make it, reducing the risk of it being stolen after you stop working on the project you issued it for.

Git Remotes

Note that systems like Git allow us to synchronise work between any two or more copies of the same repository - the ones that are not located on your machine are “Git remotes” for you. In practice, though, it is easiest to agree with your collaborators to use one copy as a central hub (such as GitHub or GitLab), where everyone pushes their changes to. This also avoid risks associated with keeping the “central copy” on someone’s laptop. You can have more than one remote configured for your local repository, each of which generally is either read-only or read/write for you. Collaborating with others involves managing these remote repositories and pushing and pulling information to and from them when you need to share work.

Git - distributed version control system
From W3Docs (freely available)

Git Branches

When we do git status, Git also tells us that we are currently on the main branch of the project. A branch is one version of your project (the files in your repository) that can contain its own set of commits. We can create a new branch, make changes to the code which we then commit to the branch, and, once we are happy with those changes, merge them back to the main branch. To see what other branches are available, do:

$ git branch

* main

At the moment, there’s only one branch (main) and hence only one version of the code available. When you create a Git repository for the first time, by default you only get one version (i.e. branch) - main. Let’s have a look at why having different branches might be useful.

Feature Branch Software Development Workflow

While it is technically OK to commit your changes directly to main branch, and you may often find yourself doing so for some minor changes, the best practice is to use a new branch for each separate and self-contained unit/piece of work you want to add to the project. This unit of work is also often called a feature and the branch where you develop it is called a feature branch. Each feature branch should have its own meaningful name - indicating its purpose (e.g. “issue23-fix”). If we keep making changes and pushing them directly to main branch on GitHub, then anyone who downloads our software from there will get all of our work in progress - whether or not it’s ready to use! So, working on a separate branch for each feature you are adding is good for several reasons:

it enables the main branch to remain stable while you and the team explore and test the new code on a feature branch,
it enables you to keep the untested and not-yet-functional feature branch code under version control and backed up,
you and other team members may work on several features at the same time independently from one another,
if you decide that the feature is not working or is no longer needed - you can easily and safely discard that branch without affecting the rest of the code.

Branches are commonly used as part of a feature-branch workflow, shown in diagram below.

Git feature branch workflow diagram

Git feature branches
From Git Tutorial by sillevl (Creative Commons Attribution 4.0 International License)

In the software development workflow, we typically have a main branch which is the version of the code that is tested, stable and reliable. Then, we normally have a development branch (called develop or dev by convention) that we use for work-in-progress code. As we work on adding new features to the code, we create new feature branches that first get merged into develop after a thorough testing process. After even more testing - develop branch will get merged into main. The points when feature branches are merged to develop, and develop to main depend entirely on the practice/strategy established in the team. For example, for smaller projects (e.g. if you are working alone on a project or in a very small team), feature branches sometimes get directly merged into main upon testing, skipping the develop branch step. In other projects, the merge into main happens only at the point of making a new software release. Whichever is the case for you, a good rule of thumb is - nothing that is broken should be in main.

Creating Branches

Let’s create a develop branch to work on:

$ git branch develop

This command does not give any output, but if we run git branch again, without giving it a new branch name, we can see the list of branches we have - including the new one we have just made.

$ git branch

    develop
  * main

The * indicates the currently active branch. So how do we switch to our new branch? We use the git checkout command with the name of the branch:

$ git checkout develop

Switched to branch 'develop'

Create and Switch to Branch Shortcut

A shortcut to create a new branch and immediately switch to it:
$ git checkout -b develop

Updating Branches

If we start updating files now, the modifications will happen on the develop branch and will not affect the version of the code in main. We add and commit things to develop branch in the same way as we do to main.

Let’s make a small modification to inflammation/models.py in PyCharm, and, say, change the spelling of “2d” to “2D” in docstrings for functions daily_mean(), daily_max() and daily_min().

If we do:

$ git status

   On branch develop
   Changes not staged for commit:
     (use "git add <file>..." to update what will be committed)
     (use "git checkout -- <file>..." to discard changes in working directory)

   	modified:   inflammation/models.py

   no changes added to commit (use "git add" and/or "git commit -a")

Git is telling us that we are on branch develop and which tracked files have been modified in our working directory.

We can now add and commit the changes in the usual way.

$ git add inflammation/models.py
$ git commit -m "Spelling fix"

Currently Active Branch

Remember, add and commit commands always act on the currently active branch. You have to be careful and aware of which branch you are working with at any given moment. git status can help with that, and you will find yourself invoking it very often.

Pushing New Branch Remotely

We push the contents of the develop branch to GitHub in the same way as we pushed the main branch. However, as we have just created this branch locally, it still does not exist in our remote repository. You can check that in GitHub by listing all branches.

Software project's main branch

To push a new local branch remotely for the first time, you could use the -u switch and the name of the branch you are creating and pushing to:

$ git push -u origin develop

Git Push With -u Switch

Using the -u switch with the git push command is a handy shortcut for: (1) creating the new remote branch and (2) setting your local branch to automatically track the remote one at the same time. You need to use the -u switch only once to set up that association between your branch and the remote one explicitly. After that you could simply use git push without specifying the remote repository, if you wished so. We still prefer to explicitly state this information in commands.

Let’s confirm that the new branch develop now exist remotely on GitHub too. From the < > Code tab in your repository in GitHub, click the branch dropdown menu (currently showing the default branch main). You should see your develop branch in the list too.

Software project's develop branch

Now the others can check out the develop branch too and continue to develop code on it.

After the initial push of the new branch, each next time we push to it in the usual manner (i.e. without the -u switch):

$ git push origin develop

Merging Into Main Branch

Once you have tested your changes on the develop branch, you will want to merge them onto the main main branch. To do so, make sure you have all your changes committed and switch to main:

$ git checkout main

Switched to branch 'main'
Your branch is up to date with 'origin/main'.

To merge the develop branch on top of main do:

$ git merge develop

Updating 05e1ffb..be60389
Fast-forward
 inflammation/models.py | 6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

If there are no conflicts, Git will merge the branches without complaining and replay all commits from develop on top of the last commit from main. If there are merge conflicts (e.g. a team collaborator modified the same portion of the same file you are working on and checked in their changes before you), the particular files with conflicts will be marked and you will need to resolve those conflicts and commit the changes before attempting to merge again. Since we have no conflicts, we can now push the main branch to the remote repository:

git push origin main

All Branches Are Equal

In Git, all branches are equal - there is nothing special about the main branch. It is called that by convention and is created by default, but it can also be called something else. A good example is gh-pages branch which is the main branch for website projects hosted on GitHub (rather than main, which can be safely deleted for such projects).

Keeping Main Branch Stable

Good software development practice is to keep the main branch stable while you and the team develop and test new functionalities on feature branches (which can be done in parallel and independently by different team members). The next step is to merge feature branches onto the develop branch, where more testing can occur to verify that the new features work well with the rest of the code (and not just in isolation). We talk more about different types of code testing in one of the following episodes.

Key Points

A branch is one version of your project that can contain its own set of commits.

Feature branches enable us to develop / explore / test new code features without affecting the stable main code.

Part 2: Improving and Managing Software Over its Lifetime

Overview

Teaching: 5 min
Exercises: 0 min

Questions

What should we do to enable software reuse, encourage external feedback, and act on it?

Objectives

Apply theoretical and practical skills learnt so far within a team environment.

Prepare and release software for reuse and manage and act on feedback to improve it.

So far in this course we’ve focused on learning technical practices, tools, and infrastructure that help the development of software in a team environment, but in an individual setting. In this section of the course we look at how to improve the reusability of our software for others as well as ourselves, the importance of critical reflection, and what we need to take into account when sharing our code with others, in the context of working as a team. We’ll also be making use of skills learnt previously in the course.

The focus in this section will also move beyond software development to management: management of how the outside world interacts with and makes use of our software, how others can interact with ourselves to report issues, and the ways we can successfully manage software improvement in response to feedback.

Managing software

In this section we will:

Look at how to prepare our software for release, looking at what we actually mean by software reusability, the importance of good documentation, as well as what to consider when choosing an open source licence.
Explore ways for us to track issues with our software registered by ourselves and external users, and how we should employ a critical mindset when reviewing software for reuse.
Examine how we can manage the improvement of our software through feedback using agile management techniques. We’ll employ effort estimation of development tasks as a foundational tool for prioritising future team work, and use the MoSCoW approach and software development sprints to manage improvement. As we will see, it is very difficult to prioritise work effectively without knowing both its relative importance to others as well as the effort required to deliver those work items.

Key Points

For software to succeed it needs to be managed as well as developed.

Estimating the effort to deliver work items is a foundational tool for prioritising that work.

Preparing Software for Reuse

Overview

Teaching: 35 min
Exercises: 20 min

Questions

What can we do to make our programs reusable by others?

How should we document and license our code?

Objectives

Describe the different levels of software reusability

Use code linting tools to verify a program’s adherence to a Python coding style

Explain why documentation is important

Describe the minimum components of software documentation to aid reuse

Create a repository README file to guide others to successfully reuse a program

Understand other documentation components and where they are useful

Describe the basic types of open source software licence

Explain the importance of conforming to data policy and regulation

Prioritise and work on improvements for release as a team

Introduction

In previous episodes we’ve looked at skills, practices, and tools to help us design and develop software in a collaborative environment. In this lesson we’ll be looking at a critical piece of the development puzzle that builds on what we’ve learnt so far - sharing our software with others.

The Levels of Software Reusability - Good Practice Revisited

Let’s begin by taking a closer look at software reusability and what we want from it.

Firstly, whilst we want to ensure our software is reusable by others, as well as ourselves, we should be clear what we mean by ‘reusable’. There are a number of definitions out there, but a helpful one written by Benureau and Rougler in 2017 offers the following levels by which software can be characterised:

Re-runnable: the code is simply executable and can be run again (but there are no guarantees beyond that)
Repeatable: the software will produce the same result more than once
Reproducible: published research results generated from the same version of the software can be generated again from the same input data
Reusable: easy to use, understand, and modify
Replicable: the software can act as an available reference for any ambiguity in the algorithmic descriptions made in the published article. That is, a new implementation can be created from the descriptions in the article that provide the same results as the original implementation, and that the original - or reference - implementation, can be used to clarify any ambiguity in those descriptions for the purposes of reimplementation

Later levels imply the earlier ones. So what should we aim for? As researchers who develop software - or developers who write research software - we should be aiming for at least the fourth one: reusability. Reproducibility is required if we are to successfully claim that what we are doing when we write software fits within acceptable scientific practice, but it is also crucial that we can write software that can be understood by others. If others are unable to verify that a piece of software follows published algorithms and ideally modified. Where ‘others’, of course, can include a future version of ourselves.

Documenting Code to Improve Reusability

Reproducibility is a cornerstone of science, and scientists who work in many disciplines are expected to document the processes by which they’ve conducted their research so it can be reproduced by others. In medicinal, pharmacological, and similar research fields for example, researchers use logbooks which are then used to write up protocols and methods for publication.

Many things we’ve covered so far contribute directly to making our software reproducible - and indeed reusable - by others. A key part of this we’ll cover now is software documentation, which is ironically very often given short shrift in academia. This is often the case even in fields where the documentation and publication of research method is otherwise taken very seriously.

A few reasons for this are that writing documentation is often considered:

A low priority compared to actual research (if it’s even considered at all)
Expensive in terms of effort, with little reward
Writing documentation is boring!

A very useful form of documentation for understanding our code is code commenting, and is most effective when used to explain complex interfaces or behaviour, or the reasoning behind why something is coded a certain way. But code comments only go so far.

Whilst it’s certainly arguable that writing documentation isn’t as exciting as writing code, it doesn’t have to be expensive and brings many benefits. In addition to enabling general reproducibility by others, documentation…

Helps bring new staff researchers and developers up to speed quickly with using the software
Functions as a great aid to research collaborations involving software, where those from other teams need to use it
When well written, can act as a basis for detailing algorithms and other mechanisms in research papers, such that the software’s functionality can be replicated and re-implemented elsewhere
Provides a descriptive link back to the science that underlies it. As a reference, it makes it far easier to know how to update the software as the scientific theory changes (and potentially vice versa)
Importantly, it can enable others to understand the software sufficiently to modify and reuse it to do different things

In the next section we’ll see that writing a sensible minimum set of documentation in a single document doesn’t have to be expensive, and can greatly aid reproducibility.

Writing a README

A README file is the first piece of documentation (perhaps other than publications that refer to it) that people should read to acquaint themselves with the software. It concisely explains what the software is about and what it’s for, and covers the steps necessary to obtain and install the software and use it to accomplish basic tasks. Think of it not as a comprehensive reference of all functionality, but more a short tutorial with links to further information - hence it should contain brief explanations and be focused on instructional steps.

Our repository already has a README that describes the purpose of the repository for this workshop, but let’s replace it with a new one that describes the software itself. First let’s delete the old one:

$ rm README.md

In the root of your repository create a replacement README.md file. The .md indicates this is a markdown file, a lightweight markup language which is basically a text file with some extra syntax to provide ways of formatting them. A big advantage of them is that they can be read as plain-text files or as source files for rendering them with formatting structures, and are very quick to write. GitHub provides a very useful [guide to writing markdown][github-markdown] for its repositories.

Let’s start writing it.

# Inflam

So here, we’re giving our software a name. Ideally something unique, short, snappy, and perhaps to some degree an indicator of what it does. We would ideally rename the repository to reflect the new name, but let’s leave that for now. In markdown, the # designates a heading, two ## are used for a subheading, and so on. The Software Sustainability Institute [guide on naming projects][ssi-choosing-name] and products provides some helpful pointers.

We should also add a short description.

Inflam is a data management system written in Python that manages trial data used in clinical inflammation studies.

To give readers an idea of the software’s capabilities, let’s add some key features next:

## Main features

Here are some key features of Inflam:

- Provide basic statistical analyses over clinical trial data
- Ability to work on trial data in Comma-Separated Value (CSV) format
- Generate plots of trial data
- Analytical functions and views can be easily extended based on its Model-View-Controller architecture

As well as knowing what the software aims to do and its key features, it’s very important to specify what other software and related dependencies are needed to use the software (typically called dependencies or prerequisites):

## Prerequisites

Inflam requires the following Python packages:

- [NumPy](https://www.numpy.org/) - makes use of NumPy's statistical functions
- [Matplotlib](https://matplotlib.org/stable/index.html) - uses Matplotlib to generate statistical plots

The following optional packages are required to run Inflam's unit tests:

- [pytest](https://docs.pytest.org/en/stable/) - Inflam's unit tests are written using pytest
- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing

Here we’re making use of markdown links, with some text describing the link within [] followed by the link itself within ().

One really neat feature - and a common practice - of using many CI infrastructures is that we can include the status of running recent tests within our README file. Just below the # Inflam title on our README.md file, add the following (replacing <your_github_username> with your own:

# Inflam

![Continuous Integration build in GitHub Actions](https://github.com/<your_github_username>/python-intermediate-inflammation/workflows/CI/badge.svg?branch=main)

This will embed a badge at the top of our page that reflects the most recent GitHub Actions build status of our repository, essentially showing whether the tests that were run when the last change was made to the main branch succeeded or failed.

That’s got us started, but there are other aspects we should also cover:

Installation/deployment: step-by-step instructions for setting up the software so it can be used
Basic usage: step-by-step instructions that cover using the software to accomplish basic tasks
Contributing: for those wishing to contribute to the software’s development, this is an opportunity to detail what kinds of contribution are sought and how to get involved
Contact information/getting help: which may include things like key author email addresses, and links to mailing lists and other resources
Credits/Acknowledgements: where appropriate, be sure to credit those who have helped in the software’s development or inspired it
Citation: particularly for academic software, it’s a very good idea to specify a reference to an appropriate academic publication so other academics can cite use of the software in their own publications and media. You can do this within a separate CITATION text file within the repository’s root directory and link to it from the markdown
Licence: a short description of and link to the software’s licence

For more verbose sections, there are usually just highlights in the README with links to further information, which may be held within other markdown files within the repository or elsewhere.

We’ll finish these off later. See Matias Singer’s curated list of awesome READMEs for inspiration.

Choosing an Open Source Licence

Software licensing can be a whole topic in itself, so we’ll just summarise here. Your institution’s Intellectual Property (IP) team will be able to offer specific guidance that fits the way your institution thinks about software.

In IP law, software is considered a creative work of literature, so any code you write automatically has copyright protection applied. This copyright will usually belong to the institution that employs you, but this may be different for PhD students. If you need to check, this should be included in your employment / studentship contract or talk to your university’s IP team.

Since software is automatically under copyright, without a licence no one may:

Copy it
Distribute it
Modify it
Extend it
Use it (actually unclear at present - this has not been properly tested in court yet)

Fundamentally there are two kinds of licence, Open Source licences and Proprietary licences, which serve slightly different purposes:

Proprietary licences are designed to pass on limited rights to end users, and are most suitable if you want to commercialise your software. They tend to be customised to suit the requirements of the software and the institution to which it belongs - again your institutions IP team will be able to help here.
Open Source licences are designed more to protect the rights of end users - they specifically grant permission to make modifications and redistribute the software to others. The website Choose A License provides recommendations and a simple summary of some of the most common open source licences.

Within the open source licences, there are two categories, copyleft and permissive:

The permissive licences such as MIT and the multiple variants of the BSD licence are designed to give maximum freedom to the end users of software. These licences allow the end user to do almost anything with the source code.
The copyleft licences in the GPL still give a lot of freedom to the end users, but any code that they write based on GPLed code must also be licensed under the same licence. This gives the developer assurance that anyone building on their code is also contributing back to the community. It’s actually a little more complicated than this, and the variants all have slightly different conditions and applicability, but this is the core of the licence.

Which of these types of licence you prefer is up to you and those you develop code with. If you want more information, or help choosing a licence, the Choose An Open-Source Licence or tl;dr Legal sites can help.

Preparing for Release

In a (hopefully) highly unlikely and thoroughly unrecommended scenario, your project leader has informed you of the need to release your software within the next half hour, so it can be assessed for use by another team. You’ll need to consider finishing the README, choosing a licence, and fixing any remaining problems you are aware of in your codebase. Ensure you prioritise and work on the most pressing issues first!

Time: 20 mins

Merging into main

Once you’ve done these updates, commit your changes, and if you’re doing this work on a feature branch also ensure you merge it into develop, e.g.:

$ git checkout develop
$ git merge my-feature-branch

Finally, once we’ve fully tested our software and are confident it works as expected on develop, we can merge our develop branch into main:

$ git checkout main
$ git merge develop
$ git push

Tagging a Release in GitHub

There are many ways in which Git and GitHub can help us make a software release from our code. One of these is via tagging, where we attach a human-readable label to a specific commit. Let’s see what tags we currently have in our repository:

$ git tag

Since we haven’t tagged any commits yet, there’s unsurprisingly no output. We can create a new tag on the last commit we did by doing:

$ git tag -a v1.0.0 -m "Version 1.0.0"

So we can now do:

$ git tag

v.1.0.0

And also, for more information:

$ git show v1.0.0

You should see something like this:

tag v1.0.0
Tagger: <Name> <email>
Date:   Fri Dec 10 10:22:36 2021 +0000

Version 1.0.0

commit 2df4bfcbfc1429c12f92cecba751fb2d7c1a4e28 (HEAD -> main, tag: v1.0.0, origin/main, origin/develop, origin/HEAD, develop)
Author: <Name> <email>
Date:   Fri Dec 10 10:21:24 2021 +0000

	Finalising README.

diff --git a/README.md b/README.md
index 4818abb..5b8e7fd 100644
--- a/README.md
+++ b/README.md
@@ -22,4 +22,33 @@ Flimflam requires the following Python packages:
 The following optional packages are required to run Flimflam's unit tests:
 
 - [pytest](https://docs.pytest.org/en/stable/) - Flimflam's unit tests are written using pytest
-- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
\ No newline at end of file
+- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
+
+## Installation
+- Clone the repo ``git clone repo``
+- Install via ``pip install -e .``
+- Check everything runs by running ``pytest`` in the root directory
+- Hurray 😊
+
+## Contributing
+- Create an issue [here](https://github.com/Onoddil/python-intermediate-inflammation/issues)
+  - What works, what doesn't? You tell me
+- Randomly edit some code and see if it improves things, then submit a [pull request](https://github.com/Onoddil/python-intermediate-inflammation/pulls)
+- Just yell at me while I edit the code, pair programmer style!
+
+## Getting Help
+- Nice try
+
+## Credits
+- Directed by Michael Bay
+
+## Citation
+Please cite [J. F. W. Herschel, 1829, MmRAS, 3, 177](https://ui.adsabs.harvard.edu/abs/1829MmRAS...3..177H/abstract) if you used this work in your day-to-day life.  
+Please cite [C. Herschel, 1787, RSPT, 77, 1](https://ui.adsabs.harvard.edu/abs/1787RSPT...77....1H/abstract) if you actually use this for scientific work.
+
+## License
+This source code is protected under international copyright law.  All rights
+reserved and protected by the copyright holders.
+This file is confidential and only available to authorized individuals with the
+permission of the copyright holders.  If you encounter this file and do not have
+permission, please contact the copyright holders and delete this file.
\ No newline at end of file

So now we’ve added a tag, we need this reflected in our Github repository. You can push this tag to your remote by doing:

git push origin v1.0.0

What is a Version Number Anyway?

Software version numbers are everywhere, and there are many different ways to do it. A popular one to consider is Semantic Versioning, where a given version number uses the format MAJOR.MINOR.PATCH. You increment the:

MAJOR version when you make incompatible API changes

MINOR version when you add functionality in a backwards compatible manner

PATCH version when you make backwards compatible bug fixes

You can also add a hyphen followed by characters to denote a pre-release version, e.g. 1.0.0-alpha1 (first alpha release) or 1.2.3-beta4 (first beta release)

We can now use the more memorable tag to refer to this specific commit. Plus, once we’ve pushed this back up to GitHub, it appears as a specific release within our code repository which can be downloaded in compressed .zip or .tar.gz formats. Note that these downloads just contain the state of the repository at that commit, and not its entire history.

Using features like tagging allows us to highlight commits that are particularly important, which is very useful for reproducibility purposes. We can (and should) refer to specific commits for software in academic papers that make use of results from software, but tagging with a specific version number makes that just a little bit easier for humans.

Conforming to Data Policy and Regulation

We may also wish to make data available to either be used with the software or as generated results. This may be via GitHub or some other means. An important aspect to remember with sharing data on such systems is that they may reside in other countries, and we must be careful depending on the nature of the data.

We need to ensure that we are still conforming to the relevant policies and guidelines regarding how we manage research data, which may include funding council, institutional, national, and even international policies and laws. Within Europe, for example, there’s the need to conform to things like [GDPR][gdpr], for example. It’s a very good idea to make yourself aware of these aspects.

Key Points

The reuse battle is won before it is fought. Select and use good practices consistently throughout development and not just at the end.

Assessing Software for Suitability and Improvement

Overview

Teaching: 15 min
Exercises: 30 min

Questions

What makes good code actually good?

What should we look for when selecting software to reuse?

Objectives

Explain why a critical mindset is important when selecting software

Register a new issue with our code on our repository

Describe some different types of issues we can have with software

Conduct an assessment of software against suitability criteria

Describe what should be included in software issue reports and register them

Introduction

What we’ve been looking at so far enables us to adopt a more proactive and diligent attitude when developing our own software. But we should also adopt this attitude when selecting and making use of third-party software we wish to use. With pressing deadlines it’s very easy to reach for a piece of software that appears to do what you want without considering properly whether it’s a good fit for your project first. A chain is only as strong as its weakest link, and our software may inherit weaknesses in any dependent software or create other problems.

Overall, when adopting software to use it’s important to consider not only whether it has the functionality you want, but a broader range of qualities that are important for your project.

Using Issues to Record Problems With Software

As a piece of software is used, bugs and other issues will inevitably come to light - nothing is perfect! If you work on your code with collaborators, or have non-developer users, it can be helpful to have a single shared record of all the problems people have found with the code, not only to keep track of them for you to work on later, but to avoid the annoyance of people emailing you to report a bug that you already know about!

GitHub provides a framework (as does GitLab!) for managing bug reports, feature requests, and lists of future work - Issues.

Go back to the home page for your python-intermediate-inflammation repository, and click on the Issues tab. You should see a page listing the open issues on your repository, currently none.

List of project issues in GitHub

Let’s go through the process of creating a new issue. Start by clicking the New issue button.

Creating a new issue in GitHub

When you create an issue, you can provide a range of details added to them. They can be assigned to a specific developer for example - this can be a helpful way to know who, if anyone, is currently working to fix an issue (or a way to assign responsibility to someone to deal with it!).

They can also be assigned a label. The labels available for issues can be customised, and given a colour, allowing you to see at a glance from the Issues page the state of your code. The default labels include:

Bug
Documentation
Enhancement
Help Wanted
Question

The Enhancement label can be used to create issues that request new features, or if they are created by a developer, indicate planned new features. As well as highlighting problems, the Bug label can make code much more usable by allowing users to find out if anyone has had the same problem before, and also how to fix (or work around) it on their end. Enabling users to solve their own problems can save you a lot of time and stress!

In general, a good bug report should contain only one bug, specific details of the environment in which the issue appeared (operating system or browser, version of the software and its dependencies), and sufficiently clear and concise steps that allow a developer to reproduce the bug themselves. They should also be clear on what the bug reporter considers factual (“I did this and this happened”) and speculation (“I think it was caused by this”). If an error report was generated from the software itself, it’s a very good idea to include that in the bug report.

The Enhancement label is a great way to communicate your future priorities to your collaborators, and also your future self - it’s far too easy to leave a software project for a few months to write a paper, then come back and have forgotten the improvements you were going to make. If you have other users for your code, they can use the label to request new features, or changes to the way the code operates. It’s generally worth paying attention to these suggestions, especially if you spend more time developing than running the code. It can be very easy to end up with quirky behaviour because of off-the-cuff choices during development. Extra pairs of eyes can point out ways the code can be made more accessible, and the easier a code is to use, then the more widely it will be adopted and the greater its impact will be.

Wontfix

One interesting label is Wontfix, which indicates that an issue simply won’t be worked on for whatever reason. Maybe the bug it reports is outside of the use case of the software, or the feature it requests simply isn’t a priority.

The Lock issue and Pin issue buttons allow you to block future comments on an issue, and pin it to the top of the issues page. This can make it clear you’ve thought about an issue and dismissed it!

Having open, publicly-visible lists of the the limitations and problems with your code is incredibly helpful. Even if some issues end up languishing unfixed for years, letting users know about them can save them a huge amount of work attempting to fix what turns out to be an unfixable problem on their end. It can also help you see at a glance what state your code is in, making it easier to prioritise future work!

Our First Issue!

Thinking back to the previous exercise on what makes good code, with a critical eye think of an aspect of the code you have developed so far that needs improvement. It could be a bug, for example, or a documentation issue with your README, or an enhancement. Enter the details of the issue with a suitable label and select Submit new issue.

Time: 5 mins

Mentions

As lots of bugs will have similar roots, GitHub lets you reference one issue from another. Whilst writing the description of an issue, or commenting on one, if you type # you should see a list of the issues and pull requests on the repository. They are coloured green if they’re open, or white if they’re closed. Continue typing the issue number, and the list will narrow, then you can hit Return to select the entry and link the two. You can also navigate the list with the ↑ and ↓ arrow keys.

If you realise that several of your bugs have common roots, or that one Enhancement can’t be implemented before you’ve finished another, you can use the mention system to indicate which. This is a simple way to add much more information to your issues.

You can also use the mention system to link GitHub accounts. Instead of #, typing @ will bring up a list of accounts linked to the repository. Users will receive notifications when somebody else references them which you can use to notify people when you want to check a detail with them, or let them know something has been fixed (much easier than writing out all the same information again in an email!).

You Are A User

This section focuses a lot on how issues can help communicate the current state of the code to others. As a sole developer, and possibly also the only user of the code too, you might be tempted to not bother with recording issues and features as you don’t need to communicate the information to anyone else.

Unfortunately, human memory isn’t infallible! After spending six months writing your thesis, or a year working on a different sub-topic, it’s inevitable you’ll forget some of the plans you had and problems you faced. Not documenting these things can lead to you having to re-learn things you already put the effort into discovering before.

Assessing Software for Suitability

Decide on your Group’s Repository!

You all have your code repositories you have been working on throughout the course so far. For the upcoming exercise, groups will exchange repositories and review the code of the repository they inherit, and provide feedback.

Time: 5 mins

Decide as a team on one of your repositories that will represent your group. You can do this any way you wish.

Add the URL of the repository to the section in the Google Doc labelled ‘Decide on your Group’s Repository’ for this day, next to your team name in the empty table cell

Conduct Assessment on Third-Party Software

The scenario: It is envisaged that a piece of software developed by another team will be adopted and used for the long term in a number of future projects. You have been tasked with conducting an assessment of this software to identify any issues that need resolving prior to working with it, and will provide feedback to the developing team to fix these issues.

Time: 20 mins

As a team, briefly decide who will assess which aspect of the repository, e.g. its docs, tests, codebase, etc.

Obtain the URL for the repository you will assess from the Google Doc, in the section labelled ‘Decide on your Group’s Repository’ - see the last column which indicates from which team you should get their repository URL

Conduct the assessment and register any issues you find on the other team’s software repository

Be meticulous in your assessment and register as many issues as you can!

Supporting Your Software - How and How Much?

Within your collaborations and projects, what should you do to support other users? Here are some key aspects to consider:

Provide contact information: so users know what to do and how to get in contact if they run into problems

Manage your support: an issue tracker - like the one in GitHub - is essential to track and manage issues

Manage expectations: let users know the level of support you offer, in terms of when they can expect responses to queries, the scope of support (e.g. which platforms, types of releases, etc.), the types of support (e.g. bug resolution, helping develop tailored solutions), and expectations for support in the future (e.g. when project funding runs out)

All of this requires effort, and you can’t do everything. It’s therefore important to agree and be clear on how the software will be supported from the outset, whether it’s within the context of a single laboratory, project, or other collaboration, or across an entire community.

Key Points

It’s as important to have a critical attitude to adopting software as we do to developing it.

We should use issues to keep track of software problems and other requests for change - even if we are the only developer and user.

As a team agree on who and to what extent you will support software you make available to others.

Software Improvement Through Feedback

Overview

Teaching: 5 min
Exercises: 45 min

Questions

How should we handle feedback on our software?

How, and to what extent, should we provide support to our users?

Objectives

Prioritise and work on externally registered issues

Respond to submitted issue reports and provide feedback

Explain the importance of software support and choosing a suitable level of support

Introduction

When a software project has been around for even just a short amount of time, you’ll likely discover many aspects that can be improved. These can come from issues that have been registered via collaborators or users, but also those you’re aware of internally, which should also be registered as issues. When starting a new software project, you’ll also have to determine how you’ll handle all the requirements. But which ones should you work on first, which are the most important and why, and how should you organise all this work?

Software has a fundamental role to play in doing science, but unfortunately software development is often given short shrift in academia when it comes to prioritising effort. There are also many other draws on our time in addition to the research, development, and writing of publications that we do, which makes it all the more important to prioritise our time for development effectively.

In this lesson we’ll be looking at prioritising work we need to do and what we can use from the agile perspective of project management to help us do this in our software projects.

Estimation as a Foundation for Prioritisation

For simplicity, we’ll refer to our issues as requirements, since that’s essentially what they are - new requirements for our software to fulfil.

But before we can prioritise our requirements, there are some things we need to find out.

Firstly, we need to know:

The period of time we have to resolve these requirements - e.g. before the next software release, pivotal demonstration, or other deadlines requiring their completion. This is known as a timebox. This might be a week or two, but for agile, this should not be longer than a month. Longer deadlines with more complex requirements may be split into a number of timeboxes.
How much overall effort we have available - i.e. who will be involved and how much of their time we will have during this period

We also need estimates for how long each requirement will take to resolve, since we cannot meaningfully prioritise requirements without knowing what the effort tradeoffs will be. Even if we know how important each requirement is, how would we even know if completing the project is possible? Or if we don’t know how long it will take to deliver those requirements we deem to be critical to the success of a project, how can we know if we can include other less important ones?

It is often not the reality, but estimation should ideally be done by the people likely to do the actual work (i.e. the Research Software Engineers, researchers, or developers). It shouldn’t be done by project managers or PIs simply because they are not best placed to estimate, and those doing the work are the ones who are effectively committing to these figures.

Why is it so Difficult to Estimate?

Estimation is a very valuable skill to learn, and one that is often difficult. Lack of experience in estimation can play a part, but a number of psychological causes can also contribute. One of these is Dunning-Kruger, a type of cognitive bias in which people tend to overestimate their abilities, whilst in opposition to this is imposter syndrome, where due to a lack of confidence people underestimate their abilities. The key message here is to be honest about what you can do, and find out as much information that is reasonably appropriate before arriving at an estimate.

More experience in estimation will also help to reduce these effects. So keep estimating!

An effective way of helping to make your estimates more accurate is to do it as a team. Other members can ask prudent questions that may not have been considered, and bring in other sanity checks and their own development experience. Just talking things through can help uncover other complexities and pitfalls, and raise crucial questions to clarify ambiguities.

Estimate!

As a team go through the issues that your partner team has registered with your software repository, and quickly estimate how long each issue will take to resolve in minutes. Do this by blind consensus first, each anonymously submitting an estimate, and then briefly discuss your rationale and decide on a final estimate. Make sure these are honest estimates, and you are able to complete them in the allotted time!

Time: 15 mins

Using MoSCoW to Prioritise Work

Now we have our estimates we can decide how important each requirement is to the success of the project. This should be decided by the project stakeholders; those - or their representatives - who have a stake in the success of the project and are either directly affected or affected by the project, e.g. Principle Investigators, researchers, Research Software Engineers, collaborators, etc.

To prioritise these requirements we can use a method called MoSCoW, a way to reach a common understanding with stakeholders on the importance of successfully delivering each requirement for a timebox. MoSCoW is an acronym that stands for Must have, Should have, Could have, and Won’t have. Each requirement is discussed by the stakeholder group and falls into one of these categories:

Must Have (MH) - these requirements are critical to the current timebox for it to succeed. Even the inability to deliver just one of these would cause the project to be considered a failure.
Should Have (SH) - these are important requirements but not necessary for delivery in the timebox. They may be as important as Must Haves, but there may be other ways to achieve them or perhaps they can be held back for a future development timebox.
Could Have (CH) - these are desirable but not necessary, and each of these will be included in this timebox if it can be achieved.
Won’t Have (WH) - these are agreed to be out of scope for this timebox, perhaps because they are the least important or not critical for this phase of development.

In typical use, the ratio to aim for of requirements to the MH/SH/CH categories is 60%/20%/20%. Importantly, the division is by the requirement estimates, not by number of requirements, so 60% means 60% of the overall estimated effort for requirements are Must Haves.

Why is this important? Because it gives you a unique degree of control of your project. It awards you 40% of flexibility with allocating your effort depending on what’s critical and how things progress. This effectively forces a tradeoff between the effort available and critical objectives, maintaining a significant safety margin. The idea is that as a project progresses, even if it becomes clear that you are only able to deliver the Must Haves, you have delivered a successful project.

Once we’ve decided on those we’ll work on (i.e. not Won’t Haves), we can optionally assign them to a GitHub issue milestone to organise them. A milestone is a collection of issues to be worked on in a given period (or timebox). We can create a new one by selecting Issues on our repository, then milestones to display any existing milestones, then New milestone. We add in a title, a completion date (i.e. the end of this timebox), and any description for the milestone. Once created, we can view our issues and assign them to our milestone from the Issues page.

Prioritise!

Put your stakeholder hats on, and as a team apply MoSCoW to the repository issues to determine how you will prioritise effort to resolve them in the allotted time. Try to stick to the 60/20/20 rule, and assign all issues you’ll be working on (i.e. not Won’t Haves) to a new milestone, e.g. version 1.1

Time: 10 mins

Using Sprints to Organise and Work on Requirements

A sprint is an activity applied to a timebox, where development is undertaken on the agreed prioritised work for the period. In a typical sprint, there are daily meetings called scrum meetings which check on how work is progressing, and serves to highlight any blockers and challenges to meeting the sprint goal.

Conduct a Mini-Mini-Sprint

For the remaining time in this lesson, assign repository issues to team members and work on resolving them as per your MoSCoW breakdown. Once an issue has been resolved, notable progress made, or an impasse has been reached, provide concise feedback on the repository issue. Be sure to add the other team members to the chosen repository so they have access to it. You can grant Write access to others on a GitHub repository via the Settings tab for a repository, then selecting Manage access, where you can invite other GitHub users to your repository with specific permissions.

Time: however long is left

Depending on how many issues were registered on your repository, it’s likely you won’t have resolved all the issues in this first milestone. Of course, in reality, a sprint would be over a much longer period of time. In any event, as the development progresses into future sprints any unresolved issues can be reconsidered and prioritised for another milestone, which are then taken forward, and so on. This process of receiving new requirements, prioritisation, and working on them is naturally continuous - with the benefit that at key stages you are repeatedly re-evaluating what is important and needs to be worked on which helps to ensure real concrete progress against project goals and requirements which - particularly in academia - may change over time.

Key Points

Prioritisation is a key tool in academia where research goals can change and software development is often given short shrift.

In order to prioritise things to do we must first estimate the effort required to do them.

For accurate effort estimation, it should be done by the people who will actually do the work.

Aim to reduce cognitive biases in effort estimation by being honest about your abilities.

Ask other team members - or do estimation as a team - to help make accurate estimates.

MoSCoW is a useful tool for prioritising work to help ensure projects deliver successfully.

Aim for a 60%/20%/20% ratio of Must Haves/Should Haves/Could Haves for project requirements.

Survey

Overview

Teaching: min
Exercises: min

Questions

Objectives

Post-Lesson Survey

Key Points

Intermediate Software Management

Part 1: Sustainable Code Development

Overview

Command Line & Virtual Development Environment

Integrated Development Environment (IDE)

Git & GitHub

Key Points

Introduction to a Software Project

Overview

Our Software Project

Obtain the Software Project Locally

Solution

Our Software Project Structure

Have a Peek at the Data

Solution

Software Architecture

Programming Modules and Interfaces

Model-View-Controller (MVC) Architecture

MVC Examples

MVC Application Examples From your Work

Solution

Separation of Concerns

Our Project’s MVC Architecture

Key Points

Virtual Environments For Software Development

Overview

Introduction

Virtual Environments

A Specific Python or Package Version is Only Ever Installed Once

Managing Python Virtual Environments

Managing Python Packages

A Note on Anaconda and conda

Many Tools for the Job

Creating a venv Environment

Naming Virtual Environments

Python Within A Virtual Environment

Installing External Libraries in an Environment with pip

How About python3 -m pip install?

Exporting/Importing an Environment with pip

Official Documentation

Running Python Scripts From Command Line

Key Points

Integrated Software Development Environments

Overview

Introduction

Integrated Development Environments (IDEs)

Using the PyCharm IDE

Opening a Software Project

Configuring a Virtual Environment in PyCharm

Adding a Python Interpreter

Compare External Libraries in the Command Line and PyCharm

Solution

Adding an External Library

Update requirements.txt After Adding a New Dependency

Solution

Adding a Run Configuration for Our Project

Virtual Environments & Run Configurations in PyCharm

Syntax Highlighting

Code Completion

Code Definition & Documentation References

Code Search

Version Control

Running Scripts in PyCharm

Key Points

Collaborative Software Development Using Git and GitHub

Overview

Introduction

Git Refresher

Checking-in Changes to Our Project

Account Security

Git Remotes

Git Branches

Feature Branch Software Development Workflow

Creating Branches

Create and Switch to Branch Shortcut

Updating Branches

Currently Active Branch

Pushing New Branch Remotely

Git Push With -u Switch

Merging Into Main Branch

A Note on Anaconda and `conda`

Creating a `venv` Environment

Installing External Libraries in an Environment with `pip`

How About `python3 -m pip install`?

Exporting/Importing an Environment with `pip`

Update `requirements.txt` After Adding a New Dependency

Git Push With `-u` Switch