Part 1: Sustainable Code Development
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What tools are needed to collaborate on code development effectively?
Objectives
Provide an overview of all the different tools that will be used in this course.
The first section of the course is dedicated to setting up your environment for collaborative software development. In order to build working (research) software efficiently and to do it in collaboration with others rather than in isolation, you will have to get comfortable with using a number of different tools interchangeably as they’ll make your life a lot easier. There are many options when it comes to deciding on which software development tools to use for your daily tasks - we will use a few of them in this course that we believe make a difference. There are sometimes multiple tools for the job - we select one to use but mention alternatives too. As you get more comfortable with different tools and their alternatives, you will select the one that is right for you based on your personal preferences or based on what your collaborators are using.
Here is an overview of the tools we will be using.
Command Line & Virtual Development Environment
We will use the command line
(also known as the command line shell/prompt/console) to run our code and
interact with the version control tool Git and software sharing platform GitHub. We will also use command line
tools venv
and pip
to set up a virtual development environment and isolate our software project from other projects we may work on.
Integrated Development Environment (IDE)
An IDE integrates a number of tools that we need to develop a software project that goes beyond a single script - including a smart code editor, a code compiler/interpreter, a debugger, etc. It will help you write well-formatted & readable code that conforms to code style guides (such as PEP8 for Python) more efficiently by giving relevant and intelligent suggestions for code completion and refactoring. IDEs often integrate command line console and version control tools - we teach them separately in this course as this knowledge can be ported to other programming languages and command line tools you may use in the future (but is applicable to the integrated versions too).
We will use PyCharm in this course - a free, open source IDE.
Git & GitHub
Git is a free and open source distributed version control system designed to save every change made to a (software) project, allowing others to collaborate and contribute. In this course, we use Git to version control our code in conjunction with GitHub for code backup and sharing. GitHub is one of the leading integrated products and social platforms for modern software development, monitoring and management - it will help us with version control, issue management, code review, code testing/Continuous Integration, and collaborative development.
Let’s get started with setting up our software development environment!
Key Points
In order to develop (write, test, debug, backup) code efficiently, you need to use a number of different tools.
When there is a choice of tools for a task you will have to decide which tool is right for you, which may be a matter of personal preference or what the community you belong to is using.
Introduction to a Software Project
Overview
Teaching: 20 min
Exercises: 10 minQuestions
What is a design architecture of a software project?
Why is splitting code into smaller functional units (modules) good when designing software?
Objectives
Use Git to obtain a working copy of our template software project from GitHub.
Inspect the structure and architecture of our software project.
Understand Model-View-Controller (MVC) architecture in software design and its use in our project.
Our Software Project
So, you have joined a software development team that has been working on the patient inflammation project developed in Python and stored on GitHub. The software project studies inflammation in patients who have been given a new treatment for arthritis and reuses the inflammation dataset from the novice Software Carpentry Python lesson. The dataset contains information for 60 patients, who had their inflammation levels recorded for 40 days (a snapshot of data is below).
The project analyses the data to study the effect of the new arthritis treatment by checking the inflammation records across all patients but it is not finished and contains some errors. You will be working on your own and in collaboration with others to fix and build on top of the existing code during the course.
To start with the development, we have to obtain a local copy of the project on your machine and inspect it. The first step is to create a copy of the software project repository from GitHub within your own GitHub account:
- Log into your GitHub account.
- Go to the template repository URL.
- Click the
Use this template
button towards the top right of the template repository’s GitHub page to create a copy of the repository under your GitHub account (you will need to be signed into GitHub to see theUse this template
button). Note that each participant is creating their own copy to work on. Also, we are not forking the directory but creating a copy (remember - you can have only one fork but can have multiple copies of a repository in GitHub). - Make sure to select your personal account and set the name of the project to
python-intermediate-inflammation
(you can call it anything you like, but it may be easier for future group exercises if everyone uses the same name). Also set the new repository’s visibility to ‘Public’ - so it can be seen by others and by third-party Continuous Integration (CI) services (to be covered later on in the course). - Click the
Create repository from template
button and wait for GitHub to import the copy of the repository under your account. - Locate the copied repository under your own GitHub account.
Obtain the Software Project Locally
Using the command line, clone the copied repository from your GitHub account into the home directory on your computer, (to be consistent with the code examples and exercises in the course). Which command(s) would you use to get a detailed list of contents of the directory you have just cloned?
Solution
- Find the URL of the software project repository to clone from your GitHub account. Make sure you do not clone the original template repository but rather your own copy, as you should be able to push commits to it later on.
- Make sure you are located in your home directory in the command line with:
cd ~
- From your home directory, do:
git clone https://github.com/<YOUR_GITHUB_USERNAME>/python-intermediate-inflammation
. Make sure you are cloning your copy of the software project and not the template repo.- Navigate into the cloned repository in your command line with:
cd python-intermediate-inflammation
- List the contents of the directory:
ls -l
. Remember the-l
flag of thels
command and also how to get help for commands in the command line using manual pages, e.g.:man ls
.
Our Software Project Structure
Let’s inspect the content of the software project from the command line. From the root directory of the project, you can
use the command ls -l
to get a more detailed list of the contents. You should see something similar to the following.
$ cd ~/python-intermediate-inflammation
$ ls -l
total 24
-rw-r--r-- 1 carpentry users 1055 20 Apr 15:41 README.md
drwxr-xr-x 18 carpentry users 576 20 Apr 15:41 data
drwxr-xr-x 5 carpentry users 160 20 Apr 15:41 inflammation
-rw-r--r-- 1 carpentry users 1122 20 Apr 15:41 inflammation-analysis.py
drwxr-xr-x 4 carpentry users 128 20 Apr 15:41 tests
As can be seen from the above, our software project contains the README
file (that typically describes the project,
its usage, installation, authors and how to contribute), Python script inflammation-analysis.py
,
and three directories - inflammation
, data
and tests
.
The Python script inflammation-analysis.py
provides the main
entry point in the application, and on closer inspection, we can see that the inflammation
directory contains two more Python scripts -
views.py
and models.py
. We will have a more detailed look into these shortly.
$ ls -l inflammation
total 24
-rw-r--r-- 1 alex staff 71 29 Jun 09:59 __init__.py
-rw-r--r-- 1 alex staff 838 29 Jun 09:59 models.py
-rw-r--r-- 1 alex staff 649 25 Jun 13:13 views.py
Directory data
contains several files with patients’ daily inflammation information.
$ ls -l data
total 264
-rw-r--r-- 1 alex staff 5365 25 Jun 13:13 inflammation-01.csv
-rw-r--r-- 1 alex staff 5314 25 Jun 13:13 inflammation-02.csv
-rw-r--r-- 1 alex staff 5127 25 Jun 13:13 inflammation-03.csv
-rw-r--r-- 1 alex staff 5367 25 Jun 13:13 inflammation-04.csv
-rw-r--r-- 1 alex staff 5345 25 Jun 13:13 inflammation-05.csv
-rw-r--r-- 1 alex staff 5330 25 Jun 13:13 inflammation-06.csv
-rw-r--r-- 1 alex staff 5342 25 Jun 13:13 inflammation-07.csv
-rw-r--r-- 1 alex staff 5127 25 Jun 13:13 inflammation-08.csv
-rw-r--r-- 1 alex staff 5327 25 Jun 13:13 inflammation-09.csv
-rw-r--r-- 1 alex staff 5342 25 Jun 13:13 inflammation-10.csv
-rw-r--r-- 1 alex staff 5127 25 Jun 13:13 inflammation-11.csv
-rw-r--r-- 1 alex staff 5340 25 Jun 13:13 inflammation-12.csv
-rw-r--r-- 1 alex staff 22554 25 Jun 13:13 python-novice-inflammation-data.zip
-rw-r--r-- 1 alex staff 12 25 Jun 13:13 small-01.csv
-rw-r--r-- 1 alex staff 15 25 Jun 13:13 small-02.csv
-rw-r--r-- 1 alex staff 12 25 Jun 13:13 small-03.csv
The data is stored in a series of comma-separated values (CSV) format files, where:
- each row holds temperature measurements for a single patient (in some arbitrary units of inflammation),
- columns represent successive days.
Have a Peek at the Data
Which command(s) would you use to list the contents or a first few lines of
data/inflammation-01.csv
file?Solution
- To list the entire content from the project root do:
cat data/inflammation-01.csv
.- To list the first 5 lines from the project root do:
head -n 5 data/inflammation-01.csv
.0,0,1,3,2,3,6,4,5,7,2,4,11,11,3,8,8,16,5,13,16,5,8,8,6,9,10,10,9,3,3,5,3,5,4,5,3,3,0,1 0,1,1,2,2,5,1,7,4,2,5,5,4,6,6,4,16,11,14,16,14,14,8,17,4,14,13,7,6,3,7,7,5,6,3,4,2,2,1,1 0,1,1,1,4,1,6,4,6,3,6,5,6,4,14,13,13,9,12,19,9,10,15,10,9,10,10,7,5,6,8,6,6,4,3,5,2,1,1,1 0,0,0,1,4,5,6,3,8,7,9,10,8,6,5,12,15,5,10,5,8,13,18,17,14,9,13,4,10,11,10,8,8,6,5,5,2,0,2,0 0,0,1,0,3,2,5,4,8,2,9,3,3,10,12,9,14,11,13,8,6,18,11,9,13,11,8,5,5,2,8,5,3,5,4,1,3,1,1,0
Directory tests
contains several tests that have been implemented already. We will be adding more tests
during the course as our code grows.
An important thing to note here is that the structure of the project is not arbitrary. One of the big differences between novice and intermediate software development is planning the structure of your code. This structure includes software components and behavioural interactions between them (including how these components are laid out in a directory and file structure). A novice will often make up the structure of their code as they go along. However, for more advanced software development, we need to plan this structure - called a software architecture - beforehand.
Let’s have a more detailed look into what a software architecture is and which architecture is used by our software project before we start adding more code to it.
Software Architecture
A software architecture is the fundamental structure of a software system that is decided at the beginning of project development and cannot be changed that easily once implemented. It refers to a “bigger picture” of a software system that describes high-level components (modules) of the system and how they interact.
In software design and development, large systems or programs are often decomposed into a set of smaller
modules each with a subset of functionality. Typical examples of modules in programming are software libraries;
some software libraries, such as numpy
and matplotlib
in Python, are bigger modules that contain several
smaller sub-modules. Another example of modules are classes in object-oriented programming languages.
Programming Modules and Interfaces
Although modules are self-contained and independent elements to a large extent (they can depend on other modules), there are well-defined ways of how they interact with one another. These rules of interaction are called programming interfaces - they define how other modules (clients) can use a particular module. Typically, an interface to a module includes rules on how a module can take input from and how it gives output back to its clients. A client can be a human, in which case we also call these user interfaces. Even smaller functional units such as functions/methods have clearly defined interfaces - a function/method’s definition (also known as a signature) states what parameters it can take as input and what it returns as an output.
There are various software architectures around defining different ways of dividing the code into smaller modules with well defined roles, for example:
- Model–View–Controller (MVC) architecture, which we will look into in detail and use for our software project,
- Service-Oriented Architecture (SOA), which separates code into distinct services, accessible over a network by consumers (users or other services) that communicate with each other by passing data in a well-defined, shared format (protocol),
- Client-Server architecture, where clients request content or service from a server, initiating communication sessions with servers, which await incoming requests (e.g. email, network printing, the Internet),
- Multilayer architecture, is a type of architecture in which presentation, application processing and data management functions are split into distinct layers and may even be physically separated to run on separate machines - some more detail on this later in the course.
Model-View-Controller (MVC) Architecture
MVC architecture divides the related program logic into three interconnected modules:
- Model (data)
- View (client interface), and
- Controller (processes that handle input/output and manipulate the data).
Model represents the data used by a program and also contains operations/rules for manipulating and changing the data in the model. This may be a database, a file, a single data object or a series of objects - for example a table representing patients’ data.
View is the means of displaying data to users/clients within an application (i.e. provides visualisation of the state of the model). For example, displaying a window with input fields and buttons (Graphical User Interface, GUI) or textual options within a command line (Command Line Interface, CLI) are examples of Views. They include anything that the user can see from the application. While building GUIs is not the topic of this course, we will cover building CLIs in Python in later episodes.
Controller manipulates both the Model and the View. It accepts input from the View and performs the corresponding action on the Model (changing the state of the model) and then updates the View accordingly. For example, on user request, Controller updates a picture on a user’s GitHub profile and then modifies the View by displaying the updated profile back to the user.
MVC Examples
MVC architecture can be applied in scientific applications in the following manner. Model comprises those parts of the application that deal with some type of scientific processing or manipulation of the data, e.g. numerical algorithm, simulation, DNA. View is a visualisation, or format, of the output, e.g. graphical plot, diagram, chart, data table, file. Controller is the part that ties the scientific processing and output parts together, mediating input and passing it to the model or view, e.g. command line options, mouse clicks, input files. For example, the diagram below depicts the use of MVC architecture for the DNA Guide Graphical User Interface application.
MVC Application Examples From your Work
Think of some other examples from your work or life where MVC architecture may be suitable or have a discussion with your fellow learners.
Solution
MVC architecture is a popular choice when designing web and mobile applications. Users interact with a web/mobile application by sending various requests to it. Forms to collect users inputs/requests together with the info returned and displayed to the user as a result represent the View. Requests are processed by the Controller, which interacts with the Model to retrieve or update the underlying data. For example, a user may request to view its profile. The Controller retrieves the account information for the user from the Model and passes it to the View for rendering. The user may further interact with the application by asking it to update its personal information. Controller verifies the correctness of the information (e.g. the password satisfies certain criteria, postal address and phone number are in the correct format, etc.) and passes it to the Model for permanent storage. The View is then updated accordingly and the user sees its updated profile details.
Note that not everything fits into the MVC architecture but it is still good to think about how things could be split into smaller units. For a few more examples, have a look at this short article on MVC from CodeAcademy.
Separation of Concerns
Separation of concerns is important when designing software architectures in order to reduce the code’s complexity. Note, however, there are limits to everything - and MVC architecture is no exception. Controller often transcends into Model and View and a clear separation is sometimes difficult to maintain. For example, the Command Line Interface provides both the View (what user sees and how they interact with the command line) and the Controller (invoking of a command) aspects of a CLI application. In Web applications, Controller often manipulates the data (received from the Model) before displaying it to the user or passing it from the user to the Model.
Our Project’s MVC Architecture
Our software project uses the MVC architecture. The file inflammation-analysis.py
is the Controller module that
performs basic statistical analysis over patient data and provides the main
entry point into the application. The View and Model modules are contained
in the files view.py
and model.py
, respectively, and are conveniently named. Data underlying the Model is
contained within the directory data
- as we have seen already it contains several files with patients’ daily inflammation information.
We will revisit the software architecture and MVC topics once again in a later episode when we talk in more detail about software design. We now proceed to set up our virtual development environment and start working with the code using a more convenient graphical tool - IDE PyCharm.
Key Points
Programming interfaces define how individual modules within a software application interact among themselves or how the application itself interacts with its users.
MVC is a software design architecture which divides the application into three interconnected modules: Model (data), View (user interface), and Controller (input/output and data manipulation).
The software project we use throughout this course is an example of an MVC application that manipulates patients’ inflammation data and performs basic statistical analysis using Python.
Virtual Environments For Software Development
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What are virtual environments in software development and why you should use them?
How can we manage Python virtual environments and external (third-party) libraries?
Objectives
Set up a Python virtual environment for our software project using
venv
andpip
.Run our software from the command line.
Introduction
So far we have checked out our software project from GitHub and inspected its contents and architecture a bit. We now want to run our code to see what it does - let’s do that from the command line. For the most part of the course we will run our code and interact with Git from the command line, and, while we will develop and debug our code using the PyCharm IDE and it is possible to use Git from PyCharm too, typing commands in the command line ‘forces’ you to familiarise yourself and learn it well. A bonus is that this knowledge is transferable to running code in other programming languages and is independent from any IDE you may use in the future.
If you have a little peak into our code (e.g. do cat inflammation/views.py
from the project root), you will see the
following two lines somewhere at the top.
from matplotlib import pyplot as plt
import numpy as np
This means that our code requires two external libraries (also called third-party packages or dependencies) -
numpy
and matplotlib
.
Python applications often use external libraries that don’t come as part of the standard Python distribution. This means
that you will have to use a package manager tool to install them on your system.
Applications will also sometimes need a
specific version of an external library (e.g. because they require that a particular
bug has been fixed in a newer version of the library), or a specific version of Python interpreter.
This means that each Python application you work with may require a different setup and a set of dependencies so it
is important to be able to keep these configurations separate to avoid confusion between projects.
The solution for this problem is to create a self-contained virtual
environment per project, which contains a particular version of Python installation plus a number of
additional external libraries.
Virtual environments are not just a feature of Python - all modern programming languages use them to isolate code of a specific project and make it easier to develop, run, test and share code with others. In this episode, we learn how to set up a virtual environment to develop our code and manage our external dependencies.
Virtual Environments
So what exactly are virtual environments, and why use them?
A Python virtual environment is an isolated working copy of a specific version of Python interpreter together with specific versions of a number of external libraries installed into that virtual environment. A virtual environment is simply a directory with a particular structure which includes links to and enables multiple side-by-side installations of different Python interpreters or different versions of the same external library to coexist on your machine and only one to be selected for each of our projects. This allows you to work on a particular project without worrying about affecting other projects on your machine.
As more external libraries are added to your Python project over time, you can add them to its specific virtual environment and avoid a great deal of confusion by having separate (smaller) virtual environments for each project rather than one huge global environment with potential package version clashes. Another big motivator for using virtual environments is that they make sharing your code with others much easier (as we will see shortly). Here are some typical scenarios where the usage of virtual environments is highly recommended (almost unavoidable):
- You have an older project that only works under Python 2. You do not have the time to migrate the project to Python 3 or it may not even be possible as some of the third party dependencies are not available under Python 3. You have to start another project under Python 3. The best way to do this on a single machine is to set up two separate Python virtual environments.
- One of your Python 3 projects is locked to use a particular older version of a third party dependency. You cannot use the latest version of the dependency as it breaks things in your project. In a separate branch of your project, you want to try and fix problems introduced by the new version of the dependency without affecting the working version of your project. You need to set up a separate virtual environment for your branch to ‘isolate’ your code while testing the new feature.
You do not have to worry too much about specific versions of external libraries that your project depends on most of the time. Virtual environments enable you to always use the latest available version without specifying it explicitly. They also enable you to use a specific older version of a package for your project, should you need to.
A Specific Python or Package Version is Only Ever Installed Once
Note that you will not have a separate Python or package installations for each of your projects - they will only ever be installed once on your system but will be referenced from different virtual environments.
Managing Python Virtual Environments
There are several commonly used command line tools for managing Python virtual environments:
venv
, available by default from the standardPython
distribution fromPython 3.3+
virtualenv
, needs to be installed separately but supports bothPython 2.7+
andPython 3.3+
pipenv
, created to fix certain shortcomings ofvirtualenv
conda
which comes together with the Anaconda Python distribution
While there are pros and cons for using each of the above, all will do the job of managing Python
virtual environments for you and it may be a matter of personal preference which one you go for.
In this course, we will use venv
to create and manage our
virtual environment (which is the preferred way for Python 3.3+). The upside is that venv
virtual environments created from the command line are
also recognised and picked up automatically by PyCharm IDE, as we will see in the next episode.
Managing Python Packages
Part of managing your (virtual) working environment involves installing, updating and removing external packages
on your system. The Python package manager tool pip
is most commonly used for this - it interacts
and obtains the packages from the central repository called Python Package Index (PyPI).
pip
can now be used with all Python distributions (including Anaconda).
A Note on Anaconda and
conda
Anaconda is an open source Python distribution commonly used for scientific programming - it conveniently installs Python and a number of commonly used scientific computing packages so you do not have to obtain them separately.
conda
(that comes with the Anaconda distribution) is a command line tool with dual functionality: (1) it is a package manager that helps you find Python packages from remote package repositories and install them on your system, and (2) it is also a virtual environment manager. So, if you are using Anaconda Python distribution, you can useconda
for both tasks instead of usingvenv
andpip
.
Many Tools for the Job
Installing and managing Python distributions, external libraries and virtual environments is, well,
complex. There is an abundance of tools for each task, each with its advantages and disadvantages, and there are different
ways to achieve the same effect (and even different ways to install the same tool!).
Note that each Python distribution comes with its own version of
pip
- and if you have several Python versions installed you have to be extra careful to use the correct pip
to
manage external packages for that Python version.
venv
and pip
are considered the de facto standards for virtual environment and package management for Python 3.
However, the advantages of using Anaconda and conda
are that you get (most of the) packages needed for
scientific code development included with the distribution. If you are only collaborating with others who are also using
Anaconda, you may find that conda
satisfies all your needs. It is good, however, to be aware of all these tools,
and use them accordingly. As you become more familiar with them you will realise that equivalent tools work in a similar
way even though the command syntax may be different (and that there are equivalent tools for other programming languages
too to which your knowledge can be ported).
Python Environment Hell
From XKCD (Creative Commons Attribution-NonCommercial 2.5 License)
Let us have a look at how we can create and manage virtual environments from the command line using venv
and manage packages using pip
.
Creating a venv
Environment
Creating a virtual environment with venv
is done by executing the following command:
$ python3 -m venv /path/to/new/virtual/environment
where /path/to/new/virtual/environment
is a path to a directory where you want to place it - conventionally within
your software project so they are co-located.
This will create the target directory for the virtual environment (and any parent directories that don’t exist already).
For our project, let’s create a virtual environment called venv
off the project root:
$ python3 -m venv venv
If you list the contents of the newly created venv
directory, you should see something like:
$ ls -l venv
total 8
drwxr-xr-x 12 alex staff 384 5 Oct 11:47 bin
drwxr-xr-x 2 alex staff 64 5 Oct 11:47 include
drwxr-xr-x 3 alex staff 96 5 Oct 11:47 lib
-rw-r--r-- 1 alex staff 90 5 Oct 11:47 pyvenv.cfg
So, running the python3 -m venv venv
command created the target directory called venv
containing:
pyvenv.cfg
configuration file with a home key pointing to the Python installation from which the command was run,bin
subdirectory (Scripts
on Windows) containing a symlink of the Python interpreter binary used to create the environment and the standard Python library,lib/pythonX.Y/site-packages
subdirectory (Lib\site-packages
on Windows) to contain its own independent set of installed Python packages isolated from other projects,- various other configuration and supporting files and subdirectories.
Naming Virtual Environments
What is a good name to use for a virtual environment? Using “venv” or “.venv” as the name for an environment and storing it within the project’s directory seems to be the recommended way - this way when you come across such a subdirectory within a software project, by convention you know it contains its virtual environment details. A slight downside is that all different virtual environments on your machine then use the same name and the current one is determined by the context of the path you are currently located in. A (non-conventional) alternative is to use your project name for the name of the virtual environment, with the downside that there is nothing to indicate that such a directory contains a virtual environment. In our case, we have settled to use “venv” since it’s not a hidden directory and will be displayed by the command line when listing directory contents - in the future, you will decide what naming convention works best for you. Here are some references for each of the naming conventions:
- The Hitchhiker’s Guide to Python notes that “venv” is the general convention used globally
- The Python Documentation indicates that “.venv” is common
- “venv” vs “.venv” discussion
Once you’ve created a virtual environment, you will need to activate it:
$ source venv/bin/activate
(venv) $
Activating the virtual environment will change your command line’s prompt to show what virtual environment you are currently using (indicated by its name in round brackets at the start of the prompt), and modify the environment so that running Python will get you the particular version of Python configured in your virtual environment.
You can verify you are using your virtual environment’s version of Python by checking the path using which
:
(venv) $ which python3
/home/alex/python-intermediate-inflammation/venv/bin/python3
When you’re done working on your project, you can exit the environment with:
(venv) $ deactivate
If you’ve just done the deactivate
, ensure you reactivate the environment ready for the next part:
source venv/bin/activate
(venv) $
Python Within A Virtual Environment
On Mac and Linux, within a virtual environment
python
andpip
will refer to the version of Python you created the environment with. If you create a virtual environment withpython3 -m venv venv
,python
will refer topython3
andpip
will refer topip3
.On some Windows machines with Python 2 installed,
python
will refer to the copy of Python 2 installed outside of the virtual environment instead. You can always check if you are using the version of Python in your virtual environment with the commandwhich python
.We continue using
python3
andpip3
in this material to avoid confusion for those Windows users.
Note that, since our software project is being tracked by Git, the newly created virtual environment will show up in version control - we will see how to handle it using Git in one of the subsequent episodes.
Installing External Libraries in an Environment with pip
We noticed earlier that our code depends on two external libraries - numpy
and matplotlib
. In order
for the code to run on your machine, you need to
install these two dependencies into your virtual environment.
To install the latest version of a package with pip
you use pip’s install
command and specify the package’s name, e.g.:
(venv) $ pip3 install numpy
(venv) $ pip3 install matplotlib
or like this to install multiple packages at once for short:
(venv) $ pip3 install numpy matplotlib
How About
python3 -m pip install
?Why are we not using
pip
as an argument topython3
command, in the same way we did withvenv
(i.e.python3 -m venv
)?python3 -m pip install
should be used according to the official Pip documentation; other official documentation still seems to have a mixture of usages. Core Python developer Brett Cannon offers a more detailed explanation of edge cases when the two options may produce different results and recommendspython3 -m pip install
. We kept the old-style command (pip3 install
) as it seems more prevalent among developers at the moment - but it may be a convention that will soon change and certainly something you should consider.
If you run the pip3 install
command on a package that is already installed, pip
will notice this and do nothing.
To install a specific version of a Python package give the package name followed by ==
and the version number, e.g.
pip3 install numpy==1.21.1
.
To specify a minimum version of a Python package, you can
do pip3 install numpy>=1.20
.
To upgrade a package to the latest version, e.g. pip3 install --upgrade numpy
.
To display information about a particular installed package do:
(venv) $ pip3 show numpy
Name: numpy
Version: 1.21.2
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: None
License: BSD
Location: /Users/alex/work/SSI/Carpentries/python-intermediate-inflammation/inflammation/lib/python3.9/site-packages
Requires:
Required-by: matplotlib
To list all packages installed with pip
(in your current virtual environment):
(venv) $ pip3 list
Package Version
--------------- -------
cycler 0.11.0
fonttools 4.28.1
kiwisolver 1.3.2
matplotlib 3.5.0
numpy 1.21.4
packaging 21.2
Pillow 8.4.0
pip 21.1.3
pyparsing 2.4.7
python-dateutil 2.8.2
setuptools 57.0.0
setuptools-scm 6.3.2
six 1.16.0
tomli 1.2.2
To uninstall a package installed in the virtual environment do: pip3 uninstall package-name
.
You can also supply a list of packages to uninstall at the same time.
Exporting/Importing an Environment with pip
You are collaborating on a project with a team so, naturally, you will want to share your environment with your
collaborators so they can easily ‘clone’ your software project with all of its dependencies and everyone
can replicate equivalent virtual environments on their machines. pip
has a handy way of exporting,
saving and sharing virtual environments.
To export your active environment - use pip freeze
command to
produce a list of packages installed in the virtual environment.
A common convention is to put this list in a requirements.txt
file:
(venv) $ pip3 freeze > requirements.txt
(venv) $ cat requirements.txt
cycler==0.11.0
fonttools==4.28.1
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
packaging==21.2
Pillow==8.4.0
pyparsing==2.4.7
python-dateutil==2.8.2
setuptools-scm==6.3.2
six==1.16.0
tomli==1.2.2
The first of the above commands will create a requirements.txt
file in your current directory.
The requirements.txt
file can then be committed to a version control system (we will see how to do this using Git in
one of the following episodes) and
get shipped as part of your software and shared with collaborators and/or users. They can then replicate your environment and
install all the necessary packages from the project root as follows:
(venv) $ pip3 install -r requirements.txt
As your project grows - you may need to update your environment for a variety of reasons. For example, one of your project’s dependencies has
just released a new version (dependency version number update), you need an additional package for data analysis
(adding a new dependency) or you have found a better package and no longer need the older package (adding a new and
removing an old dependency). What you need to do in this case (apart from installing the new and removing the
packages that are no longer needed from your virtual environment) is update the contents of the requirements.txt
file
accordingly by re-issuing pip freeze
command and propagate the updated requirements.txt
file to your collaborators
via your code sharing platform (e.g. GitHub).
Official Documentation
For a full list of options and commands, consult the official
venv
documentation and the Installing Python Modules withpip
guide. Also check out the guide “Installing packages usingpip
and virtual environments”.
Running Python Scripts From Command Line
Congratulations! Your environment is now activated and set up to run our inflammation-analysis.py
script
from the command line.
You should already be located in the root of the python-intermediate-inflammation
directory
(if not, please navigate to it from the command line now). To run the script, type the following command:
(venv) $ python3 inflammation-analysis.py
usage: inflammation-analysis.py [-h] infiles [infiles ...]
inflammation-analysis.py: error: the following arguments are required: infiles
In the above command, we tell the command line two things:
- to find a Python interpreter (in this case, the one that was configured via the virtual environment), and
- to use it to run our script
inflammation-analysis.py
, which resides in the current directory.
As we can see, the Python interpreter ran our script, which threw an error -
inflammation-analysis.py: error: the following arguments are required: infiles
. It looks like the script expects
a list of input files to process, so this is expected behaviour since we don’t supply any.
Key Points
Virtual environments keep Python versions and dependencies required by different projects separate.
A virtual environment is itself a directory structure.
Use
venv
to create and manage Python virtual environments.Use
pip
to install and manage Python external (third-party) libraries.
pip
allows you to declare all dependencies for a project in a separate file (by convention calledrequirements.txt
) which can be shared with collaborators/users and used to replicate a virtual environment.Use
pip3 freeze > requirements.txt
to take snapshot of your project’s dependencies.Use
pip3 install -r requirements.txt
to replicate someone else’s virtual environment on your machine from therequirements.txt
file.
Integrated Software Development Environments
Overview
Teaching: 25 min
Exercises: 15 minQuestions
What are Integrated Development Environments (IDEs)?
What are the advantages of using IDEs for software development?
Objectives
Set up a (virtual) development environment in PyCharm
Use PyCharm to run a Python script
Introduction
As we have seen in the previous episode - even a simple software project is typically split into smaller functional units and modules which are kept in separate files and subdirectories. As your code starts to grow and becomes more complex, it will involve many different files and various external libraries. You will need an application to help you manage all the complexities of, and provide you with some useful (visual) facilities for, the software development process. Such clever and useful graphical software development applications are called Integrated Development Environments (IDEs).
Integrated Development Environments (IDEs)
An IDE normally consists of at least a source code editor, build automation tools and a debugger. The boundaries between modern IDEs and other aspects of the broader software development process are often blurred as nowadays IDEs also offer version control support, tools to construct graphical user interfaces (GUI) and web browser integration for web app development, source code inspection for dependencies and many other useful functionalities. The following is a list of the most commonly seen IDE features:
- syntax highlighting - to show the language constructs, keywords and the syntax errors with visually distinct colours and font effects
- code completion - to speed up programming by offering a set of possible (syntactically correct) code options
- code search - finding package, class, function and variable declarations, their usages and referencing
- version control support - to interact with source code repositories
- debugging - for setting breakpoints in the code editor, step-by-step execution of code and inspection of variables
IDEs are extremely useful and modern software development would be very hard without them. There are a number of IDEs available for Python development; a good overview is available from the Python Project Wiki. In addition to IDEs, there are also a number of code editors that have Python support. Code editors can be as simple as a text editor with syntax highlighting and code formatting capabilities (e.g. GNU EMACS, Vi/Vim, Atom). Most good code editors can also execute code and control a debugger, and some can also interact with a version control system. Compared to an IDE, a good dedicated code editor is usually smaller and quicker, but often less feature-rich. You will have to decide which one is the best for you - in this course we will learn how to use PyCharm, a free, open source Python IDE. Some popular alternatives include free and open source IDE Spyder and Microsoft’s free Visual Studio Code.
Using the PyCharm IDE
Let’s open our project in PyCharm now and familiarise ourselves with some commonly used features.
Opening a Software Project
If you don’t have PyCharm running yet, start it up now. You can skip the initial configuration steps which just go through
selecting a theme and other aspects. You should be presented with a dialog box that asks you what you want to do,
e.g. Create New Project
, Open
, or Check out from Version Control
.
Select Open
and find the software project directory python-intermediate-inflammation
you cloned earlier.
This directory is now the current working directory for PyCharm, so when we run scripts from PyCharm,
this is the directory they will run from.
PyCharm will show you a ‘Tip of the Day’ window which you can safely ignore and close for now. You may also get a warning ‘No Python interpreter configured for the project’ - we will deal with this shortly after we familiarise ourselves with the PyCharm environment. You will notice the IDE shows you a project/file navigator window on the left hand side, to traverse and select the files (and any subdirectories) within the working directory, and an editor window on the right. At the bottom, you would typically have a panel for version control, terminal (the command line within PyCharm) and a TODO list.
Select the inflammation-analysis.py
file in the project navigator on the left so that its contents are
displayed in the editor window. You may notice a warning about the missing Python interpreter at the top of the editor panel showing inflammation-analysis.py
file - this is one of the first things you will have to configure for your project before you can do any work.
You may take the shortcut and click on one of the offered options above but we want to take you through the whole process of setting up your environment in PyCharm as this is important conceptually.
Configuring a Virtual Environment in PyCharm
Before you can run the code from PyCharm, you need to explicitly specify the path to the Python interpreter on your system. The same goes for any dependencies your code may have - you need to tell PyCharm where to find them - much like we did from the command line in the previous episode. Luckily for us, we have already set up a virtual environment for our project from the command line and PyCharm is clever enough to understand it.
Adding a Python Interpreter
- Select either
PyCharm
>Preferences
(Mac) orFile
>Settings
(Linux, Windows). - In the preferences window that appears, select
Project: python-intermediate-inflammation
>Python Interpreter
from the left. You’ll see a number of Python packages displayed as a list, and importantly above that, the current Python interpreter that is being used. These may be blank or set to<No interpreter>
, or possibly the default version of Python installed on your system, e.g.Python 2.7 /usr/bin/python2.7
, which we do not want to use in this instance. - Select the cog-like button in the top right, then
Add Local...
(orAdd...
depending on your PyCharm version). AnAdd Python Interpreter
window will appear. - Select
Virtualenv
from the list on the left and ensure thatExisting environment
checkbox is selected within the popup window. In theInterpreter
field point to the Python 3 executable inside your virtual environment’sbin
directory (make sure you navigate to it and select it from the file browser rather than just accept the default offered by PyCharm). Note that there is also an option to create a new virtual environment, but we are not using that option as we want to reuse the one we created from the command line in the previous episode. - Select
Make available to all projects
checkbox so we can also use this environment for other projects if we wish. - Select
OK
in theAdd Python Interpreter
window. Back in thePreferences
window, you should select “Python 3.9 (python-intermediate-inflammation)” or similar (that you’ve just added) from theProject Interpreter
drop-down list.
Note that a number of external libraries have magically appeared under the
“Python 3.9 (python-intermediate-inflammation)” interpreter, including numpy
and matplotlib
. PyCharm has recognised
the virtual environment we created from the command line using venv
and has added these libraries
effectively replicating our virtual environment in PyCharm (referred to as “Python 3.9 (python-intermediate-inflammation)”).
Also note that, although the names are not the same - this is one and the same virtual environment and changes done to it in PyCharm will propagate to the command line and vice versa. Let’s see this in action through the following exercise.
Compare External Libraries in the Command Line and PyCharm
Can you recall two places where information about our project’s dependencies can be found from the command line? Compare that information with the equivalent configuration in PyCharm.
Hint: We can use an argument to
pip
, or find the packages directly in a subdirectory of our virtual environment directoryvenv
.Solution
From the previous episode, you may remember that we can get the list of packages in the current virtual environment using the
pip3 list
command:(venv) $ pip3 list
Package Version --------------- ------- cycler 0.11.0 fonttools 4.28.1 kiwisolver 1.3.2 matplotlib 3.5.0 numpy 1.21.4 packaging 21.2 Pillow 8.4.0 pip 21.1.3 pyparsing 2.4.7 python-dateutil 2.8.2 setuptools 57.0.0 setuptools-scm 6.3.2 six 1.16.0 tomli 1.2.2
However,
pip3 list
shows all the packages in the virtual environment - if we want to see only the list of packages that we installed, we can use thepip3 freeze
command instead:(venv) $ pip3 freeze
cycler==0.11.0 fonttools==4.28.1 kiwisolver==1.3.2 matplotlib==3.5.0 numpy==1.21.4 packaging==21.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 setuptools-scm==6.3.2 six==1.16.0 tomli==1.2.2
We see
pip
inpip3 list
but not inpip3 freeze
as we did not install it usingpip
. Remember that we usepip3 freeze
to update ourrequirements.txt
file, to keep a list of the packages our virtual environment includes. Python will not do this automatically; we have to manually update the file when our requirements change using:pip3 freeze > requirements.txt
If we want, we can also see the list of packages directly in the following subdirectory of
venv
:(venv) $ ls -l venv/lib/python3.9/site-packages
total 1088 drwxr-xr-x 103 alex staff 3296 17 Nov 11:55 PIL drwxr-xr-x 9 alex staff 288 17 Nov 11:55 Pillow-8.4.0.dist-info drwxr-xr-x 6 alex staff 192 17 Nov 11:55 __pycache__ drwxr-xr-x 5 alex staff 160 17 Nov 11:53 _distutils_hack drwxr-xr-x 8 alex staff 256 17 Nov 11:55 cycler-0.11.0.dist-info -rw-r--r-- 1 alex staff 14519 17 Nov 11:55 cycler.py drwxr-xr-x 14 alex staff 448 17 Nov 11:55 dateutil -rw-r--r-- 1 alex staff 152 17 Nov 11:53 distutils-precedence.pth drwxr-xr-x 31 alex staff 992 17 Nov 11:55 fontTools drwxr-xr-x 9 alex staff 288 17 Nov 11:55 fonttools-4.28.1.dist-info drwxr-xr-x 8 alex staff 256 17 Nov 11:55 kiwisolver-1.3.2.dist-info -rwxr-xr-x 1 alex staff 216968 17 Nov 11:55 kiwisolver.cpython-39-darwin.so drwxr-xr-x 92 alex staff 2944 17 Nov 11:55 matplotlib -rw-r--r-- 1 alex staff 569 17 Nov 11:55 matplotlib-3.5.0-py3.9-nspkg.pth drwxr-xr-x 20 alex staff 640 17 Nov 11:55 matplotlib-3.5.0.dist-info drwxr-xr-x 7 alex staff 224 17 Nov 11:55 mpl_toolkits drwxr-xr-x 39 alex staff 1248 17 Nov 11:55 numpy drwxr-xr-x 11 alex staff 352 17 Nov 11:55 numpy-1.21.4.dist-info drwxr-xr-x 15 alex staff 480 17 Nov 11:55 packaging drwxr-xr-x 10 alex staff 320 17 Nov 11:55 packaging-21.2.dist-info drwxr-xr-x 8 alex staff 256 17 Nov 11:53 pip drwxr-xr-x 10 alex staff 320 17 Nov 11:53 pip-21.1.3.dist-info drwxr-xr-x 7 alex staff 224 17 Nov 11:53 pkg_resources -rw-r--r-- 1 alex staff 90 17 Nov 11:55 pylab.py drwxr-xr-x 8 alex staff 256 17 Nov 11:55 pyparsing-2.4.7.dist-info -rw-r--r-- 1 alex staff 273365 17 Nov 11:55 pyparsing.py drwxr-xr-x 9 alex staff 288 17 Nov 11:55 python_dateutil-2.8.2.dist-info drwxr-xr-x 41 alex staff 1312 17 Nov 11:53 setuptools drwxr-xr-x 11 alex staff 352 17 Nov 11:53 setuptools-57.0.0.dist-info drwxr-xr-x 19 alex staff 608 17 Nov 11:55 setuptools_scm drwxr-xr-x 10 alex staff 320 17 Nov 11:55 setuptools_scm-6.3.2.dist-info drwxr-xr-x 8 alex staff 256 17 Nov 11:55 six-1.16.0.dist-info -rw-r--r-- 1 alex staff 34549 17 Nov 11:55 six.py drwxr-xr-x 8 alex staff 256 17 Nov 11:55 tomli drwxr-xr-x 7 alex staff 224 17 Nov 11:55 tomli-1.2.2.dist-info
Finally, if you look at both the contents of
venv/lib/python3.9/site-packages
andrequirements.txt
and compare that with the packages shown in PyCharm’s Python Interpreter Configuration - you will see that they all contain equivalent information.
Adding an External Library
We have already added packages numpy
and matplotlib
to our virtual environment from the command line
in the previous episode, so we are up-to-date with all external libraries we require at the moment. However, we will need library pytest
soon to implement tests for our code so will use this
opportunity to install it from PyCharm in order to see an alternative
way of doing this and how it propagates to the command line.
- Select either
PyCharm
>Preferences
(Mac) orFile
>Settings
(Linux, Windows). - In the preferences window that appears, select
Project: python-intermediate-inflammation
>Project Interpreter
from the left. - Select the
+
icon at the top of the window. In the window that appears, search for the name of the library (pytest
), select it from the list, then selectInstall Package
. - Select
OK
in thePreferences
window.
It may take a few minutes for PyCharm to install it. After it is done, the pytest
library is added to our
virtual environment. You can also verify this from the command line by listing the venv/lib/python3.9/site-packages
subdirectory. Note, however, that requirements.txt
is not updated - as we mentioned earlier this is something you have to do manually. Let’s do this as an exercise.
Update
requirements.txt
After Adding a New DependencyExport the newly updated virtual environment into
requirements.txt
file.Solution
Let’s verify first that the newly installed library
pytest
is appearing in our virtual environment but not inrequirements.txt
. First, let’s check the list of installed(venv) $ pip3 list
Package Version --------------- ------- attrs 21.4.0 cycler 0.11.0 fonttools 4.28.5 iniconfig 1.1.1 kiwisolver 1.3.2 matplotlib 3.5.1 numpy 1.22.0 packaging 21.3 Pillow 9.0.0 pip 20.0.2 pluggy 1.0.0 py 1.11.0 pyparsing 3.0.7 pytest 6.2.5 python-dateutil 2.8.2 setuptools 44.0.0 six 1.16.0 toml 0.10.2 tomli 2.0.0
We can see the
pytest
library appearing in the listing above. However, if we do:(venv) $ cat requirements.txt
cycler==0.11.0 fonttools==4.28.1 kiwisolver==1.3.2 matplotlib==3.5.0 numpy==1.21.4 packaging==21.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 setuptools-scm==6.3.2 six==1.16.0 tomli==1.2.2
pytest
is missing fromrequirements.txt
. To add it, we need to update the file by repeating the command:(venv) $ pip3 freeze > requirements.txt
pytest
is now present inrequirements.txt
:attrs==21.2.0 cycler==0.11.0 fonttools==4.28.1 iniconfig==1.1.1 kiwisolver==1.3.2 matplotlib==3.5.0 numpy==1.21.4 packaging==21.2 Pillow==8.4.0 pluggy==1.0.0 py==1.11.0 pyparsing==2.4.7 pytest==6.2.5 python-dateutil==2.8.2 setuptools-scm==6.3.2 six==1.16.0 toml==0.10.2 tomli==1.2.2
Adding a Run Configuration for Our Project
Having configured a virtual environment, we now need to tell PyCharm to use it for our project. This is done by adding a Run Configuration to a project:
- To add a new configuration for a project - select
Run
>Edit Configurations...
from the top menu. - Select
Add new run configuration...
thenPython
. - In the new popup window, in the
Script path
field select the folder button and find and selectinflammation-analysis.py
. This tells PyCharm which script to run (i.e. what the main entry point to our application is). - In the same window, select “Python 3.9 (python-intermediate-inflammation)” in the
Python interpreter
field. - You can give this run configuration a name at the top of the window if you like - e.g. let’s name it
inflammation
. - You can optionally configure run parameters and environment variables in the same window - we do not need this at the moment.
- Select
Apply
to confirm these settings.
Virtual Environments & Run Configurations in PyCharm
We configured the Python interpreter to use for our project by pointing PyCharm to the virtual environment we created from the command line (which also includes external libraries our code needs to run). Recall that you can create several virtual environments based on the same Python interpreter but with different external libraries - this is helpful when you need to develop different types of applications. For example, you can create one virtual environment based on Python 3.9 to develop Django Web applications and another virtual environment based on the same Python 3.9 to work with scientific libraries.
Run Configurations in PyCharm are named sets of startup properties that define what to execute and what parameters (i.e. what additional configuration options) to use on top of virtual environments. You can vary these configurations each time your code is executed, which is particularly useful for running, debugging and testing your code.
Now you know how to configure and manipulate your environment in both tools (command line and PyCharm), which is a useful parallel to be aware of. Let’s have a look at some other features afforded to us by PyCharm.
Syntax Highlighting
The first thing you may notice is that code is displayed using different colours. Syntax highlighting is a feature that displays source code terms in different colours and fonts according to the syntax category the highlighted term belongs to. It also makes syntax errors visually distinct. Highlighting does not affect the meaning of the code itself - it’s intended only for humans to make reading code and finding errors easier.
Code Completion
As you start typing code, PyCharm will offer to complete some of the code for you in the form of an auto completion popup. This is a context-aware code completion feature that speeds up the process of coding (e.g. reducing typos and other common mistakes) by offering available variable names, functions from available packages, parameters of functions, hints related to syntax errors, etc.
Code Definition & Documentation References
You will often need code reference information to help you code. PyCharm shows this useful information, such as definitions of symbols (e.g. functions, parameters, classes, fields, and methods) and documentation references by means of quick popups and inline tooltips.
For a selected piece of code, you can access various code reference information from the View
menu (or via various keyboard shortcuts), including:
- Quick Definition - where and how symbols (functions, parameters, classes, fields, and methods) are defined
- Quick Type Definition - type definition of variables, fields or any other symbols
- Quick Documentation - inline documentation (docstrings) for any symbol created in accordance with PEP-257)
- Parameter Info - the names of parameters in method and function calls
- Type Info - type of an expression
Code Search
You can search for a text string within a project, use different scopes to narrow your search process, use regular expressions for complex searches, include/exclude certain files from your search, find usages and occurrences. To find a search string in the whole project:
- From the main menu, select
Edit | Find | Find in Path ...
(orEdit | Find | Find in Files...
depending on your version of PyCharm). -
Type your search string in the search field of the popup. Alternatively, in the editor, highlight the string you want to find and press
Command-Shift-F
(on Mac) orControl-Shift-F
(on Windows). PyCharm places the highlighted string into the search field of the popup.If you need, specify the additional options in the popup. PyCharm will list the search strings and all the files that contain them.
- Check the results in the preview area of the dialog where you can replace the search string or select another string,
or press
Command-Shift-F
(on Mac) orControl-Shift-F
(on Windows) again to start a new search. -
To see the list of occurrences in a separate panel, click the
Open in Find Window
button in the bottom right corner. The find panel will appear at the bottom of the main window; use this panel and its options to group the results, preview them, and work with them further.
Version Control
PyCharm supports a directory-based versioning model, which means that each project directory can be associated with a different version control system. Our project was already under Git version control and PyCharm recognised it. It is also possible to add an unversioned project directory to version control directly from PyCharm.
During this course, we will do all our version control commands from the command line but it is worth noting that PyCharm supports a comprehensive subset of Git commands (i.e. it is possible to perform a set of common Git commands from PyCharm but not all). A very useful version control feature in PyCharm is graphically comparing changes you made locally to a file with the version of the file in a repository, a different commit version or a version in a different branch - this is something that cannot be done equally well from the text-based command line.
You can get a full documentation on PyCharm’s built-in version control support online.
Running Scripts in PyCharm
We have configured our environment and explored some of the most commonly used PyCharm features and are now ready to run our script from PyCharm! To do so, right-click the inflammation-analysis.py
file in the PyCharm project/file navigator on the left, and select Run 'inflammation'
.
The script will run in a terminal window at the bottom of the IDE window and display something like:
/Users/alex/work/python-intermediate-inflammation/venv/bin/python /Users/alex/work/python-intermediate-inflammation/inflammation-analysis.py
usage: inflammation-analysis.py [-h] infiles [infiles ...]
inflammation-analysis.py: error: the following arguments are required: infiles
Process finished with exit code 2
This is the same error we got when running the script from the command line. We will get back to this error shortly - for now, the good thing is that we managed to set up our project for development both from the command line and PyCharm and are getting the same outputs. Before we move on to fixing errors and writing more code, let’s have a look at the last set of tools for collaborative code development which we will be using in this course - Git and GitHub.
Key Points
An IDE is an application that provides a comprehensive set of facilities for software development, including syntax highlighting, code search and completion, version control, testing and debugging.
PyCharm recognises virtual environments configured from the command line using
venv
andpip
.
Collaborative Software Development Using Git and GitHub
Overview
Teaching: 45 min
Exercises: 0 minQuestions
What are Git branches and why are they useful for code development?
What are some best practices when developing software collaboratively using Git?
Objectives
Commit changes in a software project to a local repository and publish them in a remote repository on GitHub
Create different branches for code development
Learn to use feature branch workflow to effectively collaborate with a team on a software project
Introduction
So far we have checked out our software project from GitHub and used command line tools to configure a virtual environment for our project and run our code. We have also familiarised ourselves with PyCharm - a graphical tool we will use for code development, testing and debugging. We are now going to start using another set of tools from the collaborative code development toolbox - namely, the version control system Git and code sharing platform GitHub. These two will enable us to track changes to our code and share it with others.
You may recall that we have already made some changes to our project locally - we created a virtual
environment in venv
directory and exported it to the requirements.txt
file.
We should now decide which of those changes we want to check in and share with others in our team. This is a typical
software development workflow - you work locally on code, test it to make sure
it works correctly and as expected, then record your changes using version control and share your work with others
via a shared and centrally backed-up repository.
Firstly, let’s remind ourselves how to work with Git from the command line.
Git Refresher
Git is a version control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source code management in software development but it can be used to track changes in files in general - it is particularly effective for tracking text-based files (e.g. source code files, CSV, Markdown, HTML, CSS, Tex, etc. files).
Git has several important characteristics:
- support for non-linear development allowing you and your colleagues to work on different parts of a project concurrently,
- support for distributed development allowing for multiple people to be working on the same project (even the same file) at the same time,
- every change recorded by Git remains part of the project history and can be retrieved at a later date, so even if you make a mistake you can revert to a point before it.
The diagram below shows a typical software development lifecycle with Git and the commonly used commands to interact with different parts of Git infrastructure, such as:
- working directory - a directory (including any subdirectories) where your project files live and where you are currently working.
It is also known as the “untracked” area of Git. Any changes to files will be marked by Git in the working directory.
If you make changes to the working directory and do not explicitly tell Git to save them - you will likely lose those
changes. Using
git add filename
command, you tell Git to start tracking changes to filefilename
within your working directory. - staging area (index) - once you tell Git to start tracking changes to files (with
git add filename
command), Git saves those changes in the staging area. Each subsequent change to the same file needs to be followed by anothergit add filename
command to tell Git to update it in the staging area. To see what is in your working directory and staging area at any moment (i.e. what changes is Git tracking), run the commandgit status
. - local repository - stored within the
.git
directory of your project, this is where Git wraps together all your changes from the staging area and puts them using thegit commit
command. Each commit is a new, permanent snapshot (checkpoint, record) of your project in time, which you can share or revert back to. - remote repository - this is a version of your project that is hosted somewhere on the Internet (e.g. on GitHub, GitLab or somewhere else). While your project is nicely version-controlled in your local repository, and you have snapshots of its versions from the past, if your machine crashes - you still may lose all your work. Working with a remote repository involves pushing your changes and pulling other people’s changes to keep your local repository in sync in order to collaborate with others and to backup your work on a different machine.
Software development lifecycle with Git
From PNGWing (licenced for non-commercial reuse)
Checking-in Changes to Our Project
Let’s check-in the changes we have done to our project so far. The first thing to do upon navigating into our software project’s directory root is to check the current status of our local working directory and repository.
$ git status
On branch main
Your branch is up to date with 'origin/main'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
requirements.txt
venv/
nothing added to commit but untracked files present (use "git add" to track)
As expected, Git is telling us that we have some untracked files - requirements.txt
and directory
venv
- present in our working
directory which we have not staged nor committed to our local repository yet.
You do not want
to commit the newly created venv
directory and share it with others because this
directory is specific to your machine and setup only (i.e. it contains local paths to libraries on your
system that most likely would not work on any other machine). You do, however, want to share requirements.txt
with
your team as this file can be used to replicate the virtual environment on your collaborators’ systems.
To tell Git to intentionally ignore and not track certain files and directories, you need to specify them in the .gitignore
text file in the project root. Our project already has .gitignore
, but in cases where you do not have
it - you can simply create it yourself. In our case, we
want to tell Git to ignore the venv
directory (and .venv
as another naming convention for virtual environments)
and stop notifying us about it. Edit your .gitignore
file in PyCharm and add a line containing “venv/” and another one containing “.venv/”. It does not matter much
in this case where within the file you add these lines, so let’s do it at the end. Your .gitignore
should look something like this:
# IDEs
.vscode/
.idea/
# Intermediate Coverage file
.coverage
# Output files
*.png
# Python runtime
*.pyc
*.egg-info
.pytest_cache
# Virtual environments
venv/
.venv/
You may notice that we are already not tracking certain files and directories with useful comments about what exactly we are ignoring. You may also notice that each line in .ignore
is actually a pattern, so you can ignore multiple files that match a pattern (e.g. “*.png” will ignore all PNG files in the current directory).
If you run the git status
command now, you will notice that Git has cleverly understood that you want to ignore changes to venv
folder so it is not warning us about it any more. However, it has now detected a change to
.gitignore
file that needs to be committed.
$ git status
On branch main
Your branch is up to date with 'origin/main'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: .gitignore
Untracked files:
(use "git add <file>..." to include in what will be committed)
requirements.txt
no changes added to commit (use "git add" and/or "git commit -a")
To commit the changes .gitignore
and requirements.txt
to the local repository, we first have to add these files to
staging area to prepare them for committing. We can do that at the same time as:
$ git add .gitignore requirements.txt
Now we can commit them to the local repository with:
$ git commit -m "Initial commit of requirements.txt. Ignoring virtual env. folder."
Remember to use meaningful messages for your commits.
So far we have been working in isolation - all the changes we have done are still only stored locally on our individual machines. In order to share our work with others - we should push our changes to the remote repository on GitHub. GitHub has recently strengthened authentication requirements for Git operations accessing GitHub from the command line over HTTPS. This means you cannot use passwords for authentication any more - you need to set up and use a personal access token for additional security before you can push remotely the changes you made locally. So, when you run the command below:
$ git push origin main
Git will prompt you to authenticate - enter your GitHub username and the previously generated access token as the
password (you can also enable caching of the credentials so your machine remembers the access token). In the above command,
origin
is an alias for the remote repository you used when cloning the project locally (it is called that
by convention and set up automatically by Git when you run git clone remote_url
command to replicate a remote
repository locally); main
is the name of our
main (and currently only) development branch.
Account Security
When using
git config --global credential.helper cache
, any password or personal access token you enter will be cached for a period of time, a default of 15 minutes. Re-entering a password every 15 minutes can be OK, but for a personal access token it can be inconvenient, and lead to you writing the token down elsewhere. To permanently store passwords or tokens, usestash
instead ofcache
.Storing an access token always carries a security risk. One compromise between short cache timescales and permanent stores, is to set a time-out on your personal access token when you make it, reducing the risk of it being stolen after you stop working on the project you issued it for.
Git Remotes
Note that systems like Git allow us to synchronise work between any two or more copies of the same repository - the ones that are not located on your machine are “Git remotes” for you. In practice, though, it is easiest to agree with your collaborators to use one copy as a central hub (such as GitHub or GitLab), where everyone pushes their changes to. This also avoid risks associated with keeping the “central copy” on someone’s laptop. You can have more than one remote configured for your local repository, each of which generally is either read-only or read/write for you. Collaborating with others involves managing these remote repositories and pushing and pulling information to and from them when you need to share work.
Git - distributed version control system
From W3Docs (freely available)
Git Branches
When we do git status
, Git also tells us that we are currently on the main
branch of the project.
A branch is one version of your project (the files in your repository) that can contain its own set of commits.
We can create a new branch, make changes to the code which we then commit to the branch, and, once we are happy
with those changes, merge them back to the main branch. To see what other branches are available, do:
$ git branch
* main
At the moment, there’s only one branch (main
) and hence only one version of the code available. When you create a
Git repository for the first time, by default you only get one version (i.e. branch) - main
. Let’s have a look at
why having different branches might be useful.
Feature Branch Software Development Workflow
While it is technically OK to commit your changes directly to main
branch, and you may often find yourself doing so
for some minor changes, the best practice is to use a new branch for each separate and self-contained
unit/piece of work you want to
add to the project. This unit of work is also often called a feature and the branch where you develop it is called a
feature branch. Each feature branch should have its own meaningful name - indicating its purpose (e.g. “issue23-fix”). If we keep making changes
and pushing them directly to main
branch on GitHub, then anyone who downloads our software from there will get all of our
work in progress - whether or not it’s ready to use! So, working on a separate branch for each feature you are adding is
good for several reasons:
- it enables the main branch to remain stable while you and the team explore and test the new code on a feature branch,
- it enables you to keep the untested and not-yet-functional feature branch code under version control and backed up,
- you and other team members may work on several features at the same time independently from one another,
- if you decide that the feature is not working or is no longer needed - you can easily and safely discard that branch without affecting the rest of the code.
Branches are commonly used as part of a feature-branch workflow, shown in diagram below.
Git feature branches
From Git Tutorial by sillevl (Creative Commons Attribution 4.0 International License)
In the software development workflow, we typically have a main branch which is the version of the code that
is tested, stable and reliable. Then, we normally have a development branch
(called develop
or dev
by convention) that we use for work-in-progress
code. As we work on adding new features to the code, we create new feature branches that first get merged into
develop
after a thorough testing process. After even more testing - develop
branch will get merged into main
.
The points when feature branches are merged to develop
, and develop
to main
depend entirely on the practice/strategy established in the team. For example, for smaller projects (e.g. if you are
working alone on a project or in a very small team), feature branches sometimes get directly merged into main
upon testing,
skipping the develop
branch step. In other projects, the merge into main
happens only at the point of making a new
software release. Whichever is the case for you, a good rule of thumb is - nothing that is broken should be in main
.
Creating Branches
Let’s create a develop
branch to work on:
$ git branch develop
This command does not give any output, but if we run git branch
again, without giving it a new branch name, we can see
the list of branches we have - including the new one we have just made.
$ git branch
develop
* main
The *
indicates the currently active branch. So how do we switch to our new branch? We use the git checkout
command with the name of the branch:
$ git checkout develop
Switched to branch 'develop'
Create and Switch to Branch Shortcut
A shortcut to create a new branch and immediately switch to it:
$ git checkout -b develop
Updating Branches
If we start updating files now, the modifications will happen on the develop
branch and will not affect the version
of the code in main
. We add and commit things to develop
branch in the same way as we do to main
.
Let’s make a small modification to inflammation/models.py
in PyCharm, and, say, change the spelling of “2d” to
“2D” in docstrings for functions daily_mean()
, daily_max()
and daily_min()
.
If we do:
$ git status
On branch develop
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: inflammation/models.py
no changes added to commit (use "git add" and/or "git commit -a")
Git is telling us that we are on branch develop
and which tracked files have been modified in our working directory.
We can now add
and commit
the changes in the usual way.
$ git add inflammation/models.py
$ git commit -m "Spelling fix"
Currently Active Branch
Remember,
add
andcommit
commands always act on the currently active branch. You have to be careful and aware of which branch you are working with at any given moment.git status
can help with that, and you will find yourself invoking it very often.
Pushing New Branch Remotely
We push the contents of the develop
branch to GitHub in the same way as we pushed the main
branch. However, as we have
just created this branch locally, it still does not exist in our remote repository. You can check that in GitHub by
listing all branches.
To push a new local branch remotely for the first time, you could use the -u
switch and the name of the branch you
are creating and pushing to:
$ git push -u origin develop
Git Push With
-u
SwitchUsing the
-u
switch with thegit push
command is a handy shortcut for: (1) creating the new remote branch and (2) setting your local branch to automatically track the remote one at the same time. You need to use the-u
switch only once to set up that association between your branch and the remote one explicitly. After that you could simply usegit push
without specifying the remote repository, if you wished so. We still prefer to explicitly state this information in commands.
Let’s confirm that the new branch develop
now exist remotely on GitHub too. From the < > Code
tab in your
repository in GitHub, click the branch dropdown menu (currently showing the default branch main
). You should
see your develop
branch in the list too.
Now the others can check out the develop
branch too and continue to develop code on it.
After the initial push of the new
branch, each next time we push to it in the usual manner (i.e. without the -u
switch):
$ git push origin develop
Merging Into Main Branch
Once you have tested your changes on the develop
branch, you will want to merge them onto the main main
branch.
To do so, make sure you have all your changes committed and switch to main
:
$ git checkout main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
To merge the develop
branch on top of main
do:
$ git merge develop
Updating 05e1ffb..be60389
Fast-forward
inflammation/models.py | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
If there are no conflicts, Git will merge the branches without complaining and replay all commits from
develop
on top of the last commit from main
. If there are merge conflicts (e.g. a team collaborator modified the same
portion of the same file you are working on and checked in their changes before you), the particular files with conflicts
will be marked and you will need to resolve those conflicts and commit the changes before attempting to merge again.
Since we have no conflicts, we can now push the main
branch to the remote repository:
git push origin main
All Branches Are Equal
In Git, all branches are equal - there is nothing special about the
main
branch. It is called that by convention and is created by default, but it can also be called something else. A good example isgh-pages
branch which is the main branch for website projects hosted on GitHub (rather thanmain
, which can be safely deleted for such projects).
Keeping Main Branch Stable
Good software development practice is to keep the
main
branch stable while you and the team develop and test new functionalities on feature branches (which can be done in parallel and independently by different team members). The next step is to merge feature branches onto thedevelop
branch, where more testing can occur to verify that the new features work well with the rest of the code (and not just in isolation). We talk more about different types of code testing in one of the following episodes.
Key Points
A branch is one version of your project that can contain its own set of commits.
Feature branches enable us to develop / explore / test new code features without affecting the stable
main
code.
Part 2: Improving and Managing Software Over its Lifetime
Overview
Teaching: 5 min
Exercises: 0 minQuestions
What should we do to enable software reuse, encourage external feedback, and act on it?
Objectives
Apply theoretical and practical skills learnt so far within a team environment.
Prepare and release software for reuse and manage and act on feedback to improve it.
So far in this course we’ve focused on learning technical practices, tools, and infrastructure that help the development of software in a team environment, but in an individual setting. In this section of the course we look at how to improve the reusability of our software for others as well as ourselves, the importance of critical reflection, and what we need to take into account when sharing our code with others, in the context of working as a team. We’ll also be making use of skills learnt previously in the course.
The focus in this section will also move beyond software development to management: management of how the outside world interacts with and makes use of our software, how others can interact with ourselves to report issues, and the ways we can successfully manage software improvement in response to feedback.
In this section we will:
- Look at how to prepare our software for release, looking at what we actually mean by software reusability, the importance of good documentation, as well as what to consider when choosing an open source licence.
- Explore ways for us to track issues with our software registered by ourselves and external users, and how we should employ a critical mindset when reviewing software for reuse.
- Examine how we can manage the improvement of our software through feedback using agile management techniques. We’ll employ effort estimation of development tasks as a foundational tool for prioritising future team work, and use the MoSCoW approach and software development sprints to manage improvement. As we will see, it is very difficult to prioritise work effectively without knowing both its relative importance to others as well as the effort required to deliver those work items.
Key Points
For software to succeed it needs to be managed as well as developed.
Estimating the effort to deliver work items is a foundational tool for prioritising that work.
Preparing Software for Reuse
Overview
Teaching: 35 min
Exercises: 20 minQuestions
What can we do to make our programs reusable by others?
How should we document and license our code?
Objectives
Describe the different levels of software reusability
Use code linting tools to verify a program’s adherence to a Python coding style
Explain why documentation is important
Describe the minimum components of software documentation to aid reuse
Create a repository README file to guide others to successfully reuse a program
Understand other documentation components and where they are useful
Describe the basic types of open source software licence
Explain the importance of conforming to data policy and regulation
Prioritise and work on improvements for release as a team
Introduction
In previous episodes we’ve looked at skills, practices, and tools to help us design and develop software in a collaborative environment. In this lesson we’ll be looking at a critical piece of the development puzzle that builds on what we’ve learnt so far - sharing our software with others.
The Levels of Software Reusability - Good Practice Revisited
Let’s begin by taking a closer look at software reusability and what we want from it.
Firstly, whilst we want to ensure our software is reusable by others, as well as ourselves, we should be clear what we mean by ‘reusable’. There are a number of definitions out there, but a helpful one written by Benureau and Rougler in 2017 offers the following levels by which software can be characterised:
- Re-runnable: the code is simply executable and can be run again (but there are no guarantees beyond that)
- Repeatable: the software will produce the same result more than once
- Reproducible: published research results generated from the same version of the software can be generated again from the same input data
- Reusable: easy to use, understand, and modify
- Replicable: the software can act as an available reference for any ambiguity in the algorithmic descriptions made in the published article. That is, a new implementation can be created from the descriptions in the article that provide the same results as the original implementation, and that the original - or reference - implementation, can be used to clarify any ambiguity in those descriptions for the purposes of reimplementation
Later levels imply the earlier ones. So what should we aim for? As researchers who develop software - or developers who write research software - we should be aiming for at least the fourth one: reusability. Reproducibility is required if we are to successfully claim that what we are doing when we write software fits within acceptable scientific practice, but it is also crucial that we can write software that can be understood by others. If others are unable to verify that a piece of software follows published algorithms and ideally modified. Where ‘others’, of course, can include a future version of ourselves.
Documenting Code to Improve Reusability
Reproducibility is a cornerstone of science, and scientists who work in many disciplines are expected to document the processes by which they’ve conducted their research so it can be reproduced by others. In medicinal, pharmacological, and similar research fields for example, researchers use logbooks which are then used to write up protocols and methods for publication.
Many things we’ve covered so far contribute directly to making our software reproducible - and indeed reusable - by others. A key part of this we’ll cover now is software documentation, which is ironically very often given short shrift in academia. This is often the case even in fields where the documentation and publication of research method is otherwise taken very seriously.
A few reasons for this are that writing documentation is often considered:
- A low priority compared to actual research (if it’s even considered at all)
- Expensive in terms of effort, with little reward
- Writing documentation is boring!
A very useful form of documentation for understanding our code is code commenting, and is most effective when used to explain complex interfaces or behaviour, or the reasoning behind why something is coded a certain way. But code comments only go so far.
Whilst it’s certainly arguable that writing documentation isn’t as exciting as writing code, it doesn’t have to be expensive and brings many benefits. In addition to enabling general reproducibility by others, documentation…
- Helps bring new staff researchers and developers up to speed quickly with using the software
- Functions as a great aid to research collaborations involving software, where those from other teams need to use it
- When well written, can act as a basis for detailing algorithms and other mechanisms in research papers, such that the software’s functionality can be replicated and re-implemented elsewhere
- Provides a descriptive link back to the science that underlies it. As a reference, it makes it far easier to know how to update the software as the scientific theory changes (and potentially vice versa)
- Importantly, it can enable others to understand the software sufficiently to modify and reuse it to do different things
In the next section we’ll see that writing a sensible minimum set of documentation in a single document doesn’t have to be expensive, and can greatly aid reproducibility.
Writing a README
A README file is the first piece of documentation (perhaps other than publications that refer to it) that people should read to acquaint themselves with the software. It concisely explains what the software is about and what it’s for, and covers the steps necessary to obtain and install the software and use it to accomplish basic tasks. Think of it not as a comprehensive reference of all functionality, but more a short tutorial with links to further information - hence it should contain brief explanations and be focused on instructional steps.
Our repository already has a README that describes the purpose of the repository for this workshop, but let’s replace it with a new one that describes the software itself. First let’s delete the old one:
$ rm README.md
In the root of your repository create a replacement README.md
file. The .md
indicates this is a markdown file, a lightweight markup language which is basically a text file with some extra syntax to provide ways of formatting them. A big advantage of them is that they can be read as plain-text files or as source files for rendering them with formatting structures, and are very quick to write. GitHub provides a very useful [guide to writing markdown][github-markdown] for its repositories.
Let’s start writing it.
# Inflam
So here, we’re giving our software a name. Ideally something unique, short, snappy, and perhaps to some degree an indicator of what it does. We would ideally rename the repository to reflect the new name, but let’s leave that for now. In markdown, the #
designates a heading, two ##
are used for a subheading, and so on. The Software Sustainability Institute [guide on naming projects][ssi-choosing-name] and products provides some helpful pointers.
We should also add a short description.
Inflam is a data management system written in Python that manages trial data used in clinical inflammation studies.
To give readers an idea of the software’s capabilities, let’s add some key features next:
## Main features
Here are some key features of Inflam:
- Provide basic statistical analyses over clinical trial data
- Ability to work on trial data in Comma-Separated Value (CSV) format
- Generate plots of trial data
- Analytical functions and views can be easily extended based on its Model-View-Controller architecture
As well as knowing what the software aims to do and its key features, it’s very important to specify what other software and related dependencies are needed to use the software (typically called dependencies
or prerequisites
):
## Prerequisites
Inflam requires the following Python packages:
- [NumPy](https://www.numpy.org/) - makes use of NumPy's statistical functions
- [Matplotlib](https://matplotlib.org/stable/index.html) - uses Matplotlib to generate statistical plots
The following optional packages are required to run Inflam's unit tests:
- [pytest](https://docs.pytest.org/en/stable/) - Inflam's unit tests are written using pytest
- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
Here we’re making use of markdown links, with some text describing the link within []
followed by the link itself within ()
.
One really neat feature - and a common practice - of using many CI infrastructures is that we can include the status of running recent tests within our README file. Just below the # Inflam
title on our README.md file, add the following (replacing <your_github_username>
with your own:
# Inflam

This will embed a badge at the top of our page that reflects the most recent GitHub Actions build status of our repository, essentially showing whether the tests that were run when the last change was made to the main
branch succeeded or failed.
That’s got us started, but there are other aspects we should also cover:
- Installation/deployment: step-by-step instructions for setting up the software so it can be used
- Basic usage: step-by-step instructions that cover using the software to accomplish basic tasks
- Contributing: for those wishing to contribute to the software’s development, this is an opportunity to detail what kinds of contribution are sought and how to get involved
- Contact information/getting help: which may include things like key author email addresses, and links to mailing lists and other resources
- Credits/Acknowledgements: where appropriate, be sure to credit those who have helped in the software’s development or inspired it
- Citation: particularly for academic software, it’s a very good idea to specify a reference to an appropriate academic publication so other academics can cite use of the software in their own publications and media. You can do this within a separate CITATION text file within the repository’s root directory and link to it from the markdown
- Licence: a short description of and link to the software’s licence
For more verbose sections, there are usually just highlights in the README with links to further information, which may be held within other markdown files within the repository or elsewhere.
We’ll finish these off later. See Matias Singer’s curated list of awesome READMEs for inspiration.
Other Documentation
There are many different types of other documentation you should also consider writing and making available that’s beyond the scope of this course. The key is to consider which audiences you need to write for, e.g. end users, developers, maintainers, etc., and what they need from the documentation. There’s a Software Sustainability Institute blog post on best practices for research software documentation that helpfully covers the kinds of documentation to consider and other effective ways to convey the same information.
One that you should always consider is technical documentation. This typically aims to help other developers understand your code sufficiently well to make their own changes to it, which could include other members in your team (and as we said before, also a future version of yourself). This may include documentation that covers the software’s architecture, including the different components and how they fit together, API (Application Programmer Interface) documentation that describes the interface points designed into your software for other developers to use, e.g. for a software library, or technical tutorials/’how tos’ to accomplish developer-oriented tasks.
Choosing an Open Source Licence
Software licensing can be a whole topic in itself, so we’ll just summarise here. Your institution’s Intellectual Property (IP) team will be able to offer specific guidance that fits the way your institution thinks about software.
In IP law, software is considered a creative work of literature, so any code you write automatically has copyright protection applied. This copyright will usually belong to the institution that employs you, but this may be different for PhD students. If you need to check, this should be included in your employment / studentship contract or talk to your university’s IP team.
Since software is automatically under copyright, without a licence no one may:
- Copy it
- Distribute it
- Modify it
- Extend it
- Use it (actually unclear at present - this has not been properly tested in court yet)
Fundamentally there are two kinds of licence, Open Source licences and Proprietary licences, which serve slightly different purposes:
- Proprietary licences are designed to pass on limited rights to end users, and are most suitable if you want to commercialise your software. They tend to be customised to suit the requirements of the software and the institution to which it belongs - again your institutions IP team will be able to help here.
- Open Source licences are designed more to protect the rights of end users - they specifically grant permission to make modifications and redistribute the software to others. The website Choose A License provides recommendations and a simple summary of some of the most common open source licences.
Within the open source licences, there are two categories, copyleft and permissive:
- The permissive licences such as MIT and the multiple variants of the BSD licence are designed to give maximum freedom to the end users of software. These licences allow the end user to do almost anything with the source code.
- The copyleft licences in the GPL still give a lot of freedom to the end users, but any code that they write based on GPLed code must also be licensed under the same licence. This gives the developer assurance that anyone building on their code is also contributing back to the community. It’s actually a little more complicated than this, and the variants all have slightly different conditions and applicability, but this is the core of the licence.
Which of these types of licence you prefer is up to you and those you develop code with. If you want more information, or help choosing a licence, the Choose An Open-Source Licence or tl;dr Legal sites can help.
Preparing for Release
In a (hopefully) highly unlikely and thoroughly unrecommended scenario, your project leader has informed you of the need to release your software within the next half hour, so it can be assessed for use by another team. You’ll need to consider finishing the README, choosing a licence, and fixing any remaining problems you are aware of in your codebase. Ensure you prioritise and work on the most pressing issues first!
Time: 20 mins
Merging into main
Once you’ve done these updates, commit your changes, and if you’re doing this work on a feature branch also ensure you merge it into develop
, e.g.:
$ git checkout develop
$ git merge my-feature-branch
Finally, once we’ve fully tested our software and are confident it works as expected on develop
, we can merge our develop
branch into main
:
$ git checkout main
$ git merge develop
$ git push
Tagging a Release in GitHub
There are many ways in which Git and GitHub can help us make a software release from our code. One of these is via tagging, where we attach a human-readable label to a specific commit. Let’s see what tags we currently have in our repository:
$ git tag
Since we haven’t tagged any commits yet, there’s unsurprisingly no output. We can create a new tag on the last commit we did by doing:
$ git tag -a v1.0.0 -m "Version 1.0.0"
So we can now do:
$ git tag
v.1.0.0
And also, for more information:
$ git show v1.0.0
You should see something like this:
tag v1.0.0
Tagger: <Name> <email>
Date: Fri Dec 10 10:22:36 2021 +0000
Version 1.0.0
commit 2df4bfcbfc1429c12f92cecba751fb2d7c1a4e28 (HEAD -> main, tag: v1.0.0, origin/main, origin/develop, origin/HEAD, develop)
Author: <Name> <email>
Date: Fri Dec 10 10:21:24 2021 +0000
Finalising README.
diff --git a/README.md b/README.md
index 4818abb..5b8e7fd 100644
--- a/README.md
+++ b/README.md
@@ -22,4 +22,33 @@ Flimflam requires the following Python packages:
The following optional packages are required to run Flimflam's unit tests:
- [pytest](https://docs.pytest.org/en/stable/) - Flimflam's unit tests are written using pytest
-- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
\ No newline at end of file
+- [pytest-cov](https://pypi.org/project/pytest-cov/) - Adds test coverage stats to unit testing
+
+## Installation
+- Clone the repo ``git clone repo``
+- Install via ``pip install -e .``
+- Check everything runs by running ``pytest`` in the root directory
+- Hurray 😊
+
+## Contributing
+- Create an issue [here](https://github.com/Onoddil/python-intermediate-inflammation/issues)
+ - What works, what doesn't? You tell me
+- Randomly edit some code and see if it improves things, then submit a [pull request](https://github.com/Onoddil/python-intermediate-inflammation/pulls)
+- Just yell at me while I edit the code, pair programmer style!
+
+## Getting Help
+- Nice try
+
+## Credits
+- Directed by Michael Bay
+
+## Citation
+Please cite [J. F. W. Herschel, 1829, MmRAS, 3, 177](https://ui.adsabs.harvard.edu/abs/1829MmRAS...3..177H/abstract) if you used this work in your day-to-day life.
+Please cite [C. Herschel, 1787, RSPT, 77, 1](https://ui.adsabs.harvard.edu/abs/1787RSPT...77....1H/abstract) if you actually use this for scientific work.
+
+## License
+This source code is protected under international copyright law. All rights
+reserved and protected by the copyright holders.
+This file is confidential and only available to authorized individuals with the
+permission of the copyright holders. If you encounter this file and do not have
+permission, please contact the copyright holders and delete this file.
\ No newline at end of file
So now we’ve added a tag, we need this reflected in our Github repository. You can push this tag to your remote by doing:
git push origin v1.0.0
What is a Version Number Anyway?
Software version numbers are everywhere, and there are many different ways to do it. A popular one to consider is Semantic Versioning, where a given version number uses the format MAJOR.MINOR.PATCH. You increment the:
- MAJOR version when you make incompatible API changes
- MINOR version when you add functionality in a backwards compatible manner
- PATCH version when you make backwards compatible bug fixes
You can also add a hyphen followed by characters to denote a pre-release version, e.g. 1.0.0-alpha1 (first alpha release) or 1.2.3-beta4 (first beta release)
We can now use the more memorable tag to refer to this specific commit. Plus, once we’ve pushed this back up to GitHub, it appears as a specific release within our code repository which can be downloaded in compressed .zip
or .tar.gz
formats. Note that these downloads just contain the state of the repository at that commit, and not its entire history.
Using features like tagging allows us to highlight commits that are particularly important, which is very useful for reproducibility purposes. We can (and should) refer to specific commits for software in academic papers that make use of results from software, but tagging with a specific version number makes that just a little bit easier for humans.
Conforming to Data Policy and Regulation
We may also wish to make data available to either be used with the software or as generated results. This may be via GitHub or some other means. An important aspect to remember with sharing data on such systems is that they may reside in other countries, and we must be careful depending on the nature of the data.
We need to ensure that we are still conforming to the relevant policies and guidelines regarding how we manage research data, which may include funding council, institutional, national, and even international policies and laws. Within Europe, for example, there’s the need to conform to things like [GDPR][gdpr], for example. It’s a very good idea to make yourself aware of these aspects.
Key Points
The reuse battle is won before it is fought. Select and use good practices consistently throughout development and not just at the end.
Assessing Software for Suitability and Improvement
Overview
Teaching: 15 min
Exercises: 30 minQuestions
What makes good code actually good?
What should we look for when selecting software to reuse?
Objectives
Explain why a critical mindset is important when selecting software
Register a new issue with our code on our repository
Describe some different types of issues we can have with software
Conduct an assessment of software against suitability criteria
Describe what should be included in software issue reports and register them
Introduction
What we’ve been looking at so far enables us to adopt a more proactive and diligent attitude when developing our own software. But we should also adopt this attitude when selecting and making use of third-party software we wish to use. With pressing deadlines it’s very easy to reach for a piece of software that appears to do what you want without considering properly whether it’s a good fit for your project first. A chain is only as strong as its weakest link, and our software may inherit weaknesses in any dependent software or create other problems.
Overall, when adopting software to use it’s important to consider not only whether it has the functionality you want, but a broader range of qualities that are important for your project.
Using Issues to Record Problems With Software
As a piece of software is used, bugs and other issues will inevitably come to light - nothing is perfect! If you work on your code with collaborators, or have non-developer users, it can be helpful to have a single shared record of all the problems people have found with the code, not only to keep track of them for you to work on later, but to avoid the annoyance of people emailing you to report a bug that you already know about!
GitHub provides a framework (as does GitLab!) for managing bug reports, feature requests, and lists of future work - Issues.
Go back to the home page for your python-intermediate-inflammation
repository, and click on the Issues tab.
You should see a page listing the open issues on your repository, currently none.
Let’s go through the process of creating a new issue. Start by clicking the New issue
button.
When you create an issue, you can provide a range of details added to them. They can be assigned to a specific developer for example - this can be a helpful way to know who, if anyone, is currently working to fix an issue (or a way to assign responsibility to someone to deal with it!).
They can also be assigned a label. The labels available for issues can be customised, and given a colour, allowing you to see at a glance from the Issues page the state of your code. The default labels include:
- Bug
- Documentation
- Enhancement
- Help Wanted
- Question
The Enhancement label can be used to create issues that request new features, or if they are created by a developer, indicate planned new features. As well as highlighting problems, the Bug label can make code much more usable by allowing users to find out if anyone has had the same problem before, and also how to fix (or work around) it on their end. Enabling users to solve their own problems can save you a lot of time and stress!
In general, a good bug report should contain only one bug, specific details of the environment in which the issue appeared (operating system or browser, version of the software and its dependencies), and sufficiently clear and concise steps that allow a developer to reproduce the bug themselves. They should also be clear on what the bug reporter considers factual (“I did this and this happened”) and speculation (“I think it was caused by this”). If an error report was generated from the software itself, it’s a very good idea to include that in the bug report.
The Enhancement label is a great way to communicate your future priorities to your collaborators, and also your future self - it’s far too easy to leave a software project for a few months to write a paper, then come back and have forgotten the improvements you were going to make. If you have other users for your code, they can use the label to request new features, or changes to the way the code operates. It’s generally worth paying attention to these suggestions, especially if you spend more time developing than running the code. It can be very easy to end up with quirky behaviour because of off-the-cuff choices during development. Extra pairs of eyes can point out ways the code can be made more accessible, and the easier a code is to use, then the more widely it will be adopted and the greater its impact will be.
Wontfix
One interesting label is Wontfix, which indicates that an issue simply won’t be worked on for whatever reason. Maybe the bug it reports is outside of the use case of the software, or the feature it requests simply isn’t a priority.
The Lock issue and Pin issue buttons allow you to block future comments on an issue, and pin it to the top of the issues page. This can make it clear you’ve thought about an issue and dismissed it!
Having open, publicly-visible lists of the the limitations and problems with your code is incredibly helpful. Even if some issues end up languishing unfixed for years, letting users know about them can save them a huge amount of work attempting to fix what turns out to be an unfixable problem on their end. It can also help you see at a glance what state your code is in, making it easier to prioritise future work!
Our First Issue!
Thinking back to the previous exercise on what makes good code, with a critical eye think of an aspect of the code you have developed so far that needs improvement. It could be a bug, for example, or a documentation issue with your README, or an enhancement. Enter the details of the issue with a suitable label and select
Submit new issue
.Time: 5 mins
Mentions
As lots of bugs will have similar roots, GitHub lets you reference one issue from another. Whilst writing the description of an issue, or commenting on one, if you type # you should see a list of the issues and pull requests on the repository. They are coloured green if they’re open, or white if they’re closed. Continue typing the issue number, and the list will narrow, then you can hit Return to select the entry and link the two. You can also navigate the list with the ↑ and ↓ arrow keys.
If you realise that several of your bugs have common roots, or that one Enhancement can’t be implemented before you’ve finished another, you can use the mention system to indicate which. This is a simple way to add much more information to your issues.
You can also use the mention system to link GitHub accounts. Instead of #, typing @ will bring up a list of accounts linked to the repository. Users will receive notifications when somebody else references them which you can use to notify people when you want to check a detail with them, or let them know something has been fixed (much easier than writing out all the same information again in an email!).
You Are A User
This section focuses a lot on how issues can help communicate the current state of the code to others. As a sole developer, and possibly also the only user of the code too, you might be tempted to not bother with recording issues and features as you don’t need to communicate the information to anyone else.
Unfortunately, human memory isn’t infallible! After spending six months writing your thesis, or a year working on a different sub-topic, it’s inevitable you’ll forget some of the plans you had and problems you faced. Not documenting these things can lead to you having to re-learn things you already put the effort into discovering before.
Assessing Software for Suitability
Decide on your Group’s Repository!
You all have your code repositories you have been working on throughout the course so far. For the upcoming exercise, groups will exchange repositories and review the code of the repository they inherit, and provide feedback.
Time: 5 mins
- Decide as a team on one of your repositories that will represent your group. You can do this any way you wish.
- Add the URL of the repository to the section in the Google Doc labelled ‘Decide on your Group’s Repository’ for this day, next to your team name in the empty table cell
Conduct Assessment on Third-Party Software
The scenario: It is envisaged that a piece of software developed by another team will be adopted and used for the long term in a number of future projects. You have been tasked with conducting an assessment of this software to identify any issues that need resolving prior to working with it, and will provide feedback to the developing team to fix these issues.
Time: 20 mins
- As a team, briefly decide who will assess which aspect of the repository, e.g. its docs, tests, codebase, etc.
- Obtain the URL for the repository you will assess from the Google Doc, in the section labelled ‘Decide on your Group’s Repository’ - see the last column which indicates from which team you should get their repository URL
- Conduct the assessment and register any issues you find on the other team’s software repository
- Be meticulous in your assessment and register as many issues as you can!
Supporting Your Software - How and How Much?
Within your collaborations and projects, what should you do to support other users? Here are some key aspects to consider:
- Provide contact information: so users know what to do and how to get in contact if they run into problems
- Manage your support: an issue tracker - like the one in GitHub - is essential to track and manage issues
- Manage expectations: let users know the level of support you offer, in terms of when they can expect responses to queries, the scope of support (e.g. which platforms, types of releases, etc.), the types of support (e.g. bug resolution, helping develop tailored solutions), and expectations for support in the future (e.g. when project funding runs out)
All of this requires effort, and you can’t do everything. It’s therefore important to agree and be clear on how the software will be supported from the outset, whether it’s within the context of a single laboratory, project, or other collaboration, or across an entire community.
Key Points
It’s as important to have a critical attitude to adopting software as we do to developing it.
We should use issues to keep track of software problems and other requests for change - even if we are the only developer and user.
As a team agree on who and to what extent you will support software you make available to others.
Software Improvement Through Feedback
Overview
Teaching: 5 min
Exercises: 45 minQuestions
How should we handle feedback on our software?
How, and to what extent, should we provide support to our users?
Objectives
Prioritise and work on externally registered issues
Respond to submitted issue reports and provide feedback
Explain the importance of software support and choosing a suitable level of support
Introduction
When a software project has been around for even just a short amount of time, you’ll likely discover many aspects that can be improved. These can come from issues that have been registered via collaborators or users, but also those you’re aware of internally, which should also be registered as issues. When starting a new software project, you’ll also have to determine how you’ll handle all the requirements. But which ones should you work on first, which are the most important and why, and how should you organise all this work?
Software has a fundamental role to play in doing science, but unfortunately software development is often given short shrift in academia when it comes to prioritising effort. There are also many other draws on our time in addition to the research, development, and writing of publications that we do, which makes it all the more important to prioritise our time for development effectively.
In this lesson we’ll be looking at prioritising work we need to do and what we can use from the agile perspective of project management to help us do this in our software projects.
Estimation as a Foundation for Prioritisation
For simplicity, we’ll refer to our issues as requirements, since that’s essentially what they are - new requirements for our software to fulfil.
But before we can prioritise our requirements, there are some things we need to find out.
Firstly, we need to know:
- The period of time we have to resolve these requirements - e.g. before the next software release, pivotal demonstration, or other deadlines requiring their completion. This is known as a timebox. This might be a week or two, but for agile, this should not be longer than a month. Longer deadlines with more complex requirements may be split into a number of timeboxes.
- How much overall effort we have available - i.e. who will be involved and how much of their time we will have during this period
We also need estimates for how long each requirement will take to resolve, since we cannot meaningfully prioritise requirements without knowing what the effort tradeoffs will be. Even if we know how important each requirement is, how would we even know if completing the project is possible? Or if we don’t know how long it will take to deliver those requirements we deem to be critical to the success of a project, how can we know if we can include other less important ones?
It is often not the reality, but estimation should ideally be done by the people likely to do the actual work (i.e. the Research Software Engineers, researchers, or developers). It shouldn’t be done by project managers or PIs simply because they are not best placed to estimate, and those doing the work are the ones who are effectively committing to these figures.
Why is it so Difficult to Estimate?
Estimation is a very valuable skill to learn, and one that is often difficult. Lack of experience in estimation can play a part, but a number of psychological causes can also contribute. One of these is Dunning-Kruger, a type of cognitive bias in which people tend to overestimate their abilities, whilst in opposition to this is imposter syndrome, where due to a lack of confidence people underestimate their abilities. The key message here is to be honest about what you can do, and find out as much information that is reasonably appropriate before arriving at an estimate.
More experience in estimation will also help to reduce these effects. So keep estimating!
An effective way of helping to make your estimates more accurate is to do it as a team. Other members can ask prudent questions that may not have been considered, and bring in other sanity checks and their own development experience. Just talking things through can help uncover other complexities and pitfalls, and raise crucial questions to clarify ambiguities.
Estimate!
As a team go through the issues that your partner team has registered with your software repository, and quickly estimate how long each issue will take to resolve in minutes. Do this by blind consensus first, each anonymously submitting an estimate, and then briefly discuss your rationale and decide on a final estimate. Make sure these are honest estimates, and you are able to complete them in the allotted time!
Time: 15 mins
Using MoSCoW to Prioritise Work
Now we have our estimates we can decide how important each requirement is to the success of the project. This should be decided by the project stakeholders; those - or their representatives - who have a stake in the success of the project and are either directly affected or affected by the project, e.g. Principle Investigators, researchers, Research Software Engineers, collaborators, etc.
To prioritise these requirements we can use a method called MoSCoW, a way to reach a common understanding with stakeholders on the importance of successfully delivering each requirement for a timebox. MoSCoW is an acronym that stands for Must have, Should have, Could have, and Won’t have. Each requirement is discussed by the stakeholder group and falls into one of these categories:
- Must Have (MH) - these requirements are critical to the current timebox for it to succeed. Even the inability to deliver just one of these would cause the project to be considered a failure.
- Should Have (SH) - these are important requirements but not necessary for delivery in the timebox. They may be as important as Must Haves, but there may be other ways to achieve them or perhaps they can be held back for a future development timebox.
- Could Have (CH) - these are desirable but not necessary, and each of these will be included in this timebox if it can be achieved.
- Won’t Have (WH) - these are agreed to be out of scope for this timebox, perhaps because they are the least important or not critical for this phase of development.
In typical use, the ratio to aim for of requirements to the MH/SH/CH categories is 60%/20%/20%. Importantly, the division is by the requirement estimates, not by number of requirements, so 60% means 60% of the overall estimated effort for requirements are Must Haves.
Why is this important? Because it gives you a unique degree of control of your project. It awards you 40% of flexibility with allocating your effort depending on what’s critical and how things progress. This effectively forces a tradeoff between the effort available and critical objectives, maintaining a significant safety margin. The idea is that as a project progresses, even if it becomes clear that you are only able to deliver the Must Haves, you have delivered a successful project.
Once we’ve decided on those we’ll work on (i.e. not Won’t Haves), we can optionally assign them to a GitHub issue milestone to organise them. A milestone is a collection of issues to be worked on in a given period (or timebox). We can create a new one by selecting Issues
on our repository, then milestones
to display any existing milestones, then New milestone
. We add in a title, a completion date (i.e. the end of this timebox), and any description for the milestone. Once created, we can view our issues and assign them to our milestone from the Issues
page.
Prioritise!
Put your stakeholder hats on, and as a team apply MoSCoW to the repository issues to determine how you will prioritise effort to resolve them in the allotted time. Try to stick to the 60/20/20 rule, and assign all issues you’ll be working on (i.e. not Won’t Haves) to a new milestone, e.g. version 1.1
Time: 10 mins
Using Sprints to Organise and Work on Requirements
A sprint is an activity applied to a timebox, where development is undertaken on the agreed prioritised work for the period. In a typical sprint, there are daily meetings called scrum meetings which check on how work is progressing, and serves to highlight any blockers and challenges to meeting the sprint goal.
Conduct a Mini-Mini-Sprint
For the remaining time in this lesson, assign repository issues to team members and work on resolving them as per your MoSCoW breakdown. Once an issue has been resolved, notable progress made, or an impasse has been reached, provide concise feedback on the repository issue. Be sure to add the other team members to the chosen repository so they have access to it. You can grant
Write
access to others on a GitHub repository via theSettings
tab for a repository, then selectingManage access
, where you can invite other GitHub users to your repository with specific permissions.Time: however long is left
Depending on how many issues were registered on your repository, it’s likely you won’t have resolved all the issues in this first milestone. Of course, in reality, a sprint would be over a much longer period of time. In any event, as the development progresses into future sprints any unresolved issues can be reconsidered and prioritised for another milestone, which are then taken forward, and so on. This process of receiving new requirements, prioritisation, and working on them is naturally continuous - with the benefit that at key stages you are repeatedly re-evaluating what is important and needs to be worked on which helps to ensure real concrete progress against project goals and requirements which - particularly in academia - may change over time.
Key Points
Prioritisation is a key tool in academia where research goals can change and software development is often given short shrift.
In order to prioritise things to do we must first estimate the effort required to do them.
For accurate effort estimation, it should be done by the people who will actually do the work.
Aim to reduce cognitive biases in effort estimation by being honest about your abilities.
Ask other team members - or do estimation as a team - to help make accurate estimates.
MoSCoW is a useful tool for prioritising work to help ensure projects deliver successfully.
Aim for a 60%/20%/20% ratio of Must Haves/Should Haves/Could Haves for project requirements.
Survey
Overview
Teaching: min
Exercises: minQuestions
Objectives
Key Points