Version Control with Git

What is Version Control

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • What is version control and why should I use it?

Objectives
  • Understand the benefits of an automated version control system.

  • Understand the basics of how automated version control systems work.

What is Version Control?

Version control (or VC for short ) can also be called revision control or source control. The simplest description is that VC is a tool that tracks changes to files. It’s like turning on “Track Changes” in Word or Google Docs, but for code. So why would you want to do that?

1. A More Efficient Backup

Manual version control

We’ve all been in this situation before - having multiple nearly-identical versions of the same file with no meaningful explanation of what the differences are, just incremental changes in filename (thesis.doc, thesis_final.doc, thesis_final2.doc…).

If we’re just dealing with text documents, some word processors let us deal with this a little better, like Microsoft Word’s “Track Changes” or Google Docs’ version history. However, research isn’t just Word docs, it’s code and data and diagrams too, and a single paper or project can involve a whole constellation of files, all of which need backing up!

Using version control means we don’t keep dozens of different versions of our files hanging about taking up space, and when we store a revision, we store an explanation of what changed.

2. Reproducibility

When you use version control, at any point in the future, you can retrieve the correct versions of your documents, scripts or code. So, for example, a year after publication, you can get hold of the precise combination of scripts and data that you used to assemble a paper.

Version control makes reproducibility simpler. Without using version control it’s very hard to say that your research is truly reproducible…

3. To Aid Collaboration

As well as maintaining a revison history, VC tools also help multiple authors collaborate on the same file or set of files.

Professional software developers use VC to work in large teams and to keep track of what they’ve done. If you know what changes have been made to each file, you can easily combine multiple people’s changes to a single file. You can also track down where and when (and by who!) bugs in the code were introduced.

Every large software development project relies on VC, and most programmers use it for their small jobs as well.

VC is not just for software: papers, small data sets - anything that changes over time, or needs to be shared can, and probably should be stored in a version control system.

We’ll look at both the backup and collaboration scenarios, but first it’s useful to understand what going on under the hood.

How do Version Control Tools Work?

Changes are tracked sequentially

Version control systems start by storing the base version of the file that you save and then store just the changes you made at each step on the way. You can think of it like storing Lego bricks and the instructions for putting them together - if you start with the first piece, then add each other in turn, you end up with your final document.

Different versions can be saved

Once you think of changes as separate from the document itself, you can then think about taking the same document and adding different changes to it, getting different versions of the document. For example, two users can make independent sets of changes based on the same document.

Multiple versions can be merged

If there aren’t conflicts, you can even try to combine two different sets of changes together onto the same base document, a process called merging.

Version Control Alternatives

Git is overwhelmingly the most popular version control system in academia, and beyond. It’s a distributed version control system, where every developer in a team has their own full copy of a repository, and can synchronise between them.

It’s partly become such a success thanks to sites like GitHub and GitLab, which make it easy to collaborate on a Git repository, and provide all kinds of extra tools to manage software projects. Plus, GitHub offers free upgraded membership to academics, students and educators - you can apply here.

If you’re working on old projects, or ones with very specific needs, you might use Mercurial, another distributed system, or possibly Subversion, a centralised system where there’s a single copy of the repository that everyone connects to.

Because Git is so popular, and making a GitHub account is so easy, we’re going to teach you how to use them.

Graphical User Interfaces

We’re going to teach you how to use Git on the command line, as it’s the same on every single platform (Mac, Linux & Windows) - and it’s the only way to use it on high-performance clusters like Iridis. This isn’t the only way to use it, however. There are many different graphical user interfaces for Git, like:

SourceTree Git Kraken GitHub Desktop
SourceTree Git Kraken GitHub Desktop

Fundamentally, though, these are all just ‘wrappers’ around the command line version of Git. If you understand what they’re doing under the hood, you can easily switch between versions. You can, for example, manage your code on Iridis using command-line git and GitHub Desktop on your desktop workstation.

Git GUI Integrations

Most code editors and Integrated Development Environments (or IDEs) integrate Git into their UI, so you can easily see the state of your files and work with your repository. Examples include:

VS Code PyCharm & CLion RStudio/Posit
VS Code PyCharm RStudio

Others include MatLab, Atom, Sublime Text and Notepad++. The only common IDE with poor Git support is Spyder!

Key Points

  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.


Setting Up Git

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • How do I get set up to use Git?

  • How do I set up my account on GitHub?

Objectives
  • Configure git the first time it is used on a computer

  • Understand the meaning of the --global configuration flag

  • Add an SSH key to a GitHub account

Prerequisites

In this lesson we use Git from the Bash Shell. Some previous experience with the shell is expected, but isn’t mandatory.

Get Started

The slides for this material are located here.

Linux and Mac users should open a terminal, Windows users to should go to the Start Menu and open Git Bash from the Git group.

We’ll start by exploring how version control can be used to keep track of what one person did and when.

Setting Up Git

The first time we use Git on a new machine, we need to configure it. We’re going to set some global options, so when Git starts tracking changes to files it records who made them and how to contact them.

$ git config --global user.name "Firstname Surname"
$ git config --global user.email "fsurname@university.ac.uk"

(Please use your own name and the email address you used to sign up to GitHub!)

We’re going to set Nano, a simple, minimal command-line text editor to be the default for when you need to edit messages.

$ git config --global core.editor "nano -w"`

If you’re already comfortable with another command-line editor, feel free to select that!

Git commands are written git action, where action is what we actually want it to do. In this case, we’re telling Git:

The three commands above only need to be run once: the flag --global tells Git to use the settings for every project on this machine.

You can check your settings at any time:

$ git config --list

Git Help and Manual

If you forget a git command, you can access the list of commands by using -h and access the Git manual by using --help :

$ git config -h
$ git config --help

While viewing the manual, remember the : is a prompt waiting for commands and you can press Q to exit the manual.

Setting Up GitHub

In order to make sure all our work is backed up online, as well as making it easy to share with collaborators, we’re going to link our version control content to GitHub. You’ll need to create an account there. As your GitHub username will appear in the URLs of your projects there, it’s best to use a short, clear version of your name if you can.

Other Platforms

There are other repository hosting sites like GitHub - Southampton has its own instance of GitLab that’s only accessible to Southampton user accounts. We’ll use GitHub today, as it’s the easiest one to use if you want to share your code with collaborators from outside the University - getting them access to the Southampton GitLab can be a pain! Both GitHub and GitLab have the same features, though some menu names will be different!

Creating an SSH Key

We’ll need to set up SSH access to GitHub from your computer. This is how GitHub checks your identity when you try to access it - and is more secure than a password. To set up SSH access, we generate a pair of keys - one public, one private. We want to add the public key to GitHub, whilst the private one stays on our computer.

More Detail

There are full guides in the GitHub documentation for how to Make an SSH Key and Add an SSH key. We’re going to simplify them for today.

If you already have your own SSH key, feel free to skip to Add an SSH Key.

We can run a simple command to generate a new SSH key. It’ll ask you for some settings, but you should just hit enter to use the defaults for everything:

$ ssh-keygen -t ed25519
Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/smangham/.ssh/id_ed25519): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in id_ed25519
Your public key has been saved in id_ed25519.pub
The key fingerprint is:
SHA256:tm2lRVXqWdkwiu+fvOF8WxRaf6peAqZHKSaWDO8jjjs user-name@computer-name
The key's randomart image is:
+--[ED25519 256]--+
|              +..|
|           . o +o|
|    .     . o .+o|
|     + .   + .ooo|
|      * S = +.o +|
|     o + B *   o.|
|    . o o = o + .|
|  Eo . . o   O oo|
|  oo.      .o B+.|
+----[SHA256]-----+

Add an SSH Key

Now we’ve generated a key, we can add this to GitHub and register the key there. First, visit GitHub, and make sure you’ve signed in to your account. Once you’re signed in, go to GitHub > Settings > SSH and GPG keys > Add new, and you should see this:

Add New SSH Key

We need to fill in the details. Give the key a title like “Laptop SSH key”, and then paste your public key into the key box - we can find it in our ~/.ssh folder:

$ ls ~/.ssh
id_ed25519  id_ed25519.pub  known_hosts

You want to copy the contents of the .pub file, which you can display with:

$ cat ~/.ssh/id_ed25519.pub
ssh-ed25519 <SNIPPED FOR SECURITY> user-name@computer-name

Make sure you copy the .pub file and not the private key! Your private key lives on your machine and is never shared with anyone else. Then click Add key, and you’re done!

Checkpoint

Before moving on, make sure you’ve:

  • Set your Git config settings.
  • Registered your SSH key on GitHub.

Key Points

  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

  • GitHub needs an SSH key to allow access


Creating a Repository

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How do I create a version control repository?

  • Where does Git store information?

Objectives
  • Create a repository from a template.

  • Clone and use a Git repository.

  • Describe the purpose of the .git directory.

Creating a Repository

Now let’s create a new repository for us to work on.

For convenience, we’re going to work with some pre-existing template code that’s already stored in a repository. The first thing we need to do is create our own copy of that template, which we can do on GitHub.

Go to our template repository and select Use this template:

Use Template

We should get prompted to give details for what we’d like our copy of the template to be called. As this demo code is for analysing climate data, we’ll name our copy of it climate-analysis. We also want it to be public, so anyone can see and copy our code:

Repository Details

Public or Private?

GitHub will allow you to create private repositories, so only people you specify can access the code, but it’s always best to keep your code public - especially if you’re going to use it in a paper! Code that generates or analyses data is a fundamental part of your method, and if you don’t include your full method in papers your work can’t be reproduced, and reproducibility is key to the scientific process. Always keep your repositories public unless you’ve got a strong reason, like embargoes imposed by industrial partners.

A major advantage of this is if you leave academia, or you switch institution and forget to update the email on your GitHub account before you lose your old one, your work won’t be lost forever!

After a brief wait, GitHub will have created a remote repository - a copy of the files and their history stored on GitHub’s servers.

Cloning the Repository

Next, we’ll download a copy of the repository to our local machine, using the SSH key we registered earlier:

$ git clone git@github.com:yourname/climate-analysis.git

After you enter the git clone command, you should see:

Cloning into 'climate-analysis'...
The authenticity of host 'github.com (140.82.121.4)' can't be established.
ECDSA key fingerprint is SHA256:p2QAMXNIC1TJYWeIOttrVc98/R1BUFWu3/LiyKgUfQM.
ECDSA key fingerprint is MD5:7b:99:81:1e:4c:91:a5:0d:5a:2e:2e:80:13:3f:24:ca.
Are you sure you want to continue connecting (yes/no)? yes

Then, when you’re prompted, continue the connection with yes and it will finish downloading:

remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 4 (delta 0), reused 3 (delta 0), pack-reused 0
Receiving objects: 100% (4/4), done.

Now, if we use ls to list the contents of the directory, we should see we have a new directory, called climate-analysis. This is a local repository containing the code from our remote repository. It’s linked up automatically - making it easy for us to download updates to the remote repository, or to send our changes back up to it.

Other Ways To Clone

You can also clone a repository using HTTPS, like:

$ git clone https://github.com/yourname/yourrepo

However, for security reasons this is read only. You can’t send updates back to GitHub for a repository cloned using HTTPS.

If you cloned a repository using HTTPS and want to switch it to SSH, you can use:

$ git remote set-url origin git@github.com:yourname/yourrepo

Creating Repositories Locally

We’ve shown you how to create a repository on GitHub then download it via git clone, but you don’t have to do it that way.

If you want, you can create a repository locally by entering any directory and using git init. This turns any directory into a git repository, one stored entirely locally on your computer. After you’ve used git init to turn a directory into a repository, you can use the other commands we introduce in this section to add files to it.

We still want to make sure our local repository is linked to a remote repository on GitHub though! To do that, you can make an empty repository on GitHub and name it. Once you’ve got that, you can then connect your local repository to it using git remote add origin git@github.com:yourname/repositoryname.

git remote add tells your local repository to link up to a remote one, and origin git@github.com:yourname/repositoryname tells it that the remote is at git@github.com:yourname/repositoryname, and can be referred to as origin. You can link a local repository to many remote repositories if you want, but the main one is always called origin.

Exploring a Repository

Now, let’s change to our code directory and look at the files we just downloaded.

$ cd ~/climate-analysis
$ ls
climate_analysis.py  temp_conversion.py

These are some Python files for analysing climate data- you’ll recognise them if you’ve done some of our earlier lessons. Don’t worry, you don’t need to know Python to follow along.

You’ll notice that even though this directory is a version control repository, nothing actually looks special about it. But, if we add the -a flag to show everything, we can see that there’s a hidden directory called .git:

$ ls -a
.  ..  climate_analysis.py  .git  temp_conversion.py

Git stores information about the project in here. If we ever delete it, we will lose the project’s history.

Check Status

We can check that everything is set up correctly by asking Git to tell us the status of our project with the status command:

$ git status
# On branch main
nothing to commit, working tree clean

A branch is an independent line of development. We have only one, and the default name is main.

Our local repository is connected to a remote repository (called origin by default), and is currently up-to-date; we haven’t made any changes to the code yet.

Git works on commits - snapshots of the current state of the repository. “nothing to commit, working tree clean” means that the directory currently looks exactly the same as the last snapshot we took of it, with no changes or edits.

Branch names

In this workshop, we have a default branch called main. In older versions of Git, if you create a new repository on the command line, it’ll have a default branch called master, and a lot of examples online will show master instead of main. Don’t worry - branches work the same, regardless of what they’re called!

Checkpoint

Before moving on, make sure you’ve:

  • Registered your SSH key on GitHub.
  • Cloned your repository to your local machine.

Key Points

  • git clone creates a local copy of a repository from a URL.

  • Git stores all of its repository data in the .git directory.


Tracking Changes

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How do I track the changes I make to files using Git?

Objectives
  • Go through the modify-add-commit cycle for one or more files.

  • Describe where changes are stored at each stage in the modify-add-commit cycle.

Tracking Changes

We’ve got a repository now containing a few pre-existing files - so let’s add one more. You might remember seeing GitHub suggest we added a README.md to let people know what our code is about, so let’s do that on the command line. We’ll use the text editor nano, as:

$ nano README.md

Then type an example description:

# Climate Analysis Toolkit

This is a set of python scripts designed to analyse climate datafiles.

We can save our file using Control-O (Control and O at the same time), then Enter, and quit out of nano using Control-X. Our description is a bit brief, but it’s enough for now! Let’s try git status again:

$ git status
# On branch main
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	README.md
nothing added to commit but untracked files present (use "git add" to track)

Now, whilst our current snapshot of the repository is up-to-date, we’ve added a new file that we’re not tracking yet. We can tell Git to track the file we’ve just created using git add:

$ git add README.md

and then check that the right thing happened:

$ git status
# On branch main
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	new file:   README.md
#

Git now knows that it’s supposed to keep track of README.md, just like climate_analysis.py and temp_conversion.py but it hasn’t recorded that as a commit yet. We dont have a snapshot of the repository with all the existing files and README.md.

Initial Commit

To get it to do that, we need to run one more command:

$ git commit -m "Added a basic readme file."

We use the -m flag (for “message”) to record a short, descriptive comment that will help us remember later on what we did and why.

If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured at the start) so that we can write a longer message.

Good commit messages start with a brief (<50 characters) summary of changes made in the commit, NOT “Bug Fixes” or “Changes”!

If you want to go into more detail, add a blank line between the summary line and your additional notes.

[main fa90884] Added a basic readme file.
 1 file changed, 3 insertions(+)
 create mode 100644 README.md

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a revision and its short identifier is fa90884. (Your revision will have different identifier.)

If we run git status now:

$ git status
# On branch main
# Your branch is ahead of 'origin/main' by 1 commit.
#   (use "git push" to publish your local commits)
#
nothing to commit, working directory clean

it tells us our local repository is up-to-date, although now we have edits to it that the remote version of it doesn’t (we’ll get to that later!).

Add and Commit

Git has a special staging area where it keeps track of things that have been added to the current change set but not yet committed. git add puts things in this area, and git commit then copies them to long-term storage (as a commit)

What’s the Point of the Staging Area?

Why do we have this two-stage process, where we add files to the staging area, then create a commit from them?

Among other reasons, it allows you to easily bundle together a lot of changes in one go. If you changed the name of a variable used in multiple files (e.g. from t to temperature), you’d need to change it in all your files in one go in order for it to make sense. If you stored a copy of each file one-by-one you’d end up with a lot of versions of the code that didn’t work - variables with different names everywhere. The staging area lets you bundle together all those small changes that don’t work in isolation into one big change that’s coherent.

Git does give you shortcuts to reduce add -> commit to a single step, but when you’re starting out it’s always better to make sure you know what’s going in to each commit!

Review the Log

If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log:

$ git log
commit fa90884ca03dcefb97e415a374ac1aacaaa94c91 (HEAD -> main)
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 16 15:22:29 2022 +0000

    Added a basic readme file.

commit 499b6d18b36a25d3f5ab9be1b708ea48fef1dd65 (origin/main, origin/HEAD)
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 16 14:19:13 2022 +0000

    Initial commit

git log lists all revisions committed to a repository in reverse chronological order (most recent at the top).

The listing for each revision includes

Compatibility Notice

If you don’t see information on the remote branches, try git log --decorate. This ensures output will indicate, for each commit revision, whether it is up-to-date with its remote repository, if one exists. Older versions of git don’t show this information by default.

Modifying a file

Now suppose we modify an existing file, for example by adding a Docstring to the top of one of the files:

$ nano climate_analysis.py
""" Climate Analysis Tools """

When we run git status now, it tells us that a file it already knows about has been modified:

$ git status
# On branch main
# Your branch is ahead of 'origin/main' by 1 commit.
#   (use "git push" to publish your local commits)
#
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	modified:   climate_analysis.py
#
no changes added to commit (use "git add" and/or "git commit -a")

The last line is the key phrase: “no changes added to commit”.

So, while we have changed this file, but we haven’t told Git we will want to save those changes (which we do with git add) much less actually saved them (which we do with git commit).

It’s important to remember that git only stores changes when you make a commit

Review Changes and Commit

It is good practice to always review our changes before saving them. We do this using git diff. This shows us the differences between the current state of the file and the most recently commited version:

$ git diff
diff --git a/climate_analysis.py b/climate_analysis.py
index 277d6c7..d5b442d 100644
--- a/climate_analysis.py
+++ b/climate_analysis.py
@@ -1,3 +1,4 @@
+""" Climate Analysis Tools """
 import sys
 import temp_conversion
 import signal

The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other.

The key things to note are:

  1. Line 1: The files that are being compared (a/ and b/ are labels, not paths)
  2. Line 2: The two hex strings on the second line which parts of the hashes of the files being compares
  3. Line 5: The lines that have changed. (It’s complex)
  4. Below that, the changes - note the ‘+’ marker which shows an addtion

What About Jupyter Notebooks?

Git works best with plain text files containing just code (or data). If you’re using something like a Jupyter Notebook, which contains a mix of code, data and outputs, git diff can be unhelpfully messy.

Fortunately, though, the nbdime Python package includes an add-on that provides helpful, graphical git diff outputs for Jupyter Notebooks.

If you have large chunks of code in your notebooks, then once you’re confident they’re correct it’s best to split them out into .py files and import them back in. It makes them work better with Git, and also makes them easy to reuse - so you don’t keep copy-pasting them between files!

What If I’ve Already Added?

If you’ve already used git add, git diff won’t show anything. However, if you use git diff --staged it’ll show added changes.

After reviewing our change, it’s time to commit it:

$ git commit -m "Add Docstring"
# On branch main
# Your branch is ahead of 'origin/main' by 1 commit.
#   (use "git push" to publish your local commits)
#
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	modified:   climate_analysis.py
#
no changes added to commit (use "git add" and/or "git commit -a")

Whoops: Git won’t commit because we didn’t use git add first. Let’s fix that:

$ git add climate_analysis.py
$ git commit -m "Add Docstring"
[main 55d3f56] Add Docstring
 1 file changed, 1 insertion(+)

Git insists that we add files to the set we want to commit before actually committing anything, because we may not want to commit everything at once.

For example, suppose we might have fixed a bug in some existing code, but we might have added new code that’s not ready to share.

One More Change

We want to remind ourselves of some changes we need to make to a file. Using nano, add a line to the end of the climate_analysis.py file saying something like:

# TODO: Add rainfall processing code

Then check your edits, and commit them to your repository with the message “Added rainfall processing placeholder”. When you’re done, git status should show nothing to commit, working directory clean.

Solution

Edit the file using nano, remembering to use Control-O to write out, Enter to confirm the filename, then Control-X to quit:

$ nano climate_analysis.py

Now we’ve edited the file, we can check the changes:

$ git diff
diff --git a/climate_analysis.py b/climate_analysis.py
index d5b442d..6f8ed8a 100644
--- a/climate_analysis.py
+++ b/climate_analysis.py
@@ -26,3 +26,5 @@ for line in climate_data:
            kelvin = temp_conversion.fahr_to_kelvin(fahr)

            print(str(celsius)+", "+str(kelvin))
+
+# TODO: Add rainfall processing code

Now we can add the changes to our staging area, then commit them to our repository:

$ git add climate_analysis.py
$ git commit -m "Added rainfall processing placeholder"

Now we’ve got the basic loop of using Git sorted - we make changes, add them, then create a new commit with a descriptive message.

Key Points

  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log --decorate lists the commits made to the local repository, along with whether or not they are up-to-date with any remote repository.


Exploring History

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • How can I review my changes?

  • How can I recover old versions of files?

Objectives
  • Identify and use Git revision numbers.

  • Compare files with previous versions of themselves.

  • Restore old versions of files.

Exploring History

We’ve seen that git log gives us some information on what commits were made when, but let’s look a bit deeper at the specifics:

$ git log
commit f15ad111042cee7492f40ad6ff0ec18588fce753 (HEAD -> main)
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 30 17:15:47 2022 +0100

    Add rainfall processing placeholder

commit 6aeaf44173344939e9994d7ccb5512fc5b26c211
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 30 17:14:14 2022 +0100

    Add Docstring

commit 503f02b5f51d5622121e204494dfabc9b2ae7410
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 30 17:12:02 2022 +0100

    Added a basic readme file

commit 499b6d18b36a25d3f5ab9be1b708ea48fef1dd65 (origin/main, origin/HEAD)
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 16 14:19:13 2022 +0000

    Initial commit

We can see commits identified by long IDs, but also HEAD at the top of the log. HEAD is the name used to refer to the most recent end of the chain of commits to our local repository.

Relative History

What if somehow we’ve introduced a bug, and we want to see what’s changed between our latest version of the code and the copy that was working last commit, or a few commits ago? Which lines did we edit, and what did we add?

We can use git diff again to look for differences between files, but refer to the versions of the files as saved in older commits using the notation HEAD~1, HEAD~2, and so on to refer to the commits. We can refer to previous commits using the ~ notation, so HEAD~1 (pronounced “head minus one”) means “the previous commit”, while HEAD~123 goes back 123 commits from the latest one.

$ git diff HEAD~1 climate_analysis.py
diff --git a/climate_analysis.py b/climate_analysis.py
index d5b442d..c463f71 100644
--- a/climate_analysis.py
+++ b/climate_analysis.py
@@ -26,3 +26,5 @@ for line in climate_data:
             kelvin = temp_conversion.fahr_to_kelvin(fahr)

             print(str(celsius)+", "+str(kelvin))
+
+# TODO: Add rainfall processing code

So we see the difference between the file as it is now, and as it was the commit before before the latest one.

$ git diff HEAD~2 climate_analysis.py
diff --git a/climate_analysis.py b/climate_analysis.py
index 277d6c7..c463f71 100644
--- a/climate_analysis.py
+++ b/climate_analysis.py
@@ -1,3 +1,4 @@
+""" Climate Analysis Tools """
 import sys
 import temp_conversion
 import signal
@@ -25,3 +26,5 @@ for line in climate_data:
             kelvin = temp_conversion.fahr_to_kelvin(fahr)

             print(str(celsius)+", "+str(kelvin))
+
+# TODO: Add rainfall processing code

And here we see the state before the last two commits, HEAD minus 2.

Absolute History

What about if we want to compare our version of the code to the version from last month, or from the version we used to make a paper last year? Calculating the number of commits just isn’t realistic - it’ll change all the time as we keep working on the code. Instead, we can refer to specific revisions using those long strings of digits and letters that git log displays.

These are unique IDs for the changes, and “unique” really does mean unique: every change to any set of files on any machine has a unique 40-character identifier. (A SHA-1 hash of the new, post-commit state of the repository).

If we scroll down to the bottom of the git log output, we can see the ID for our first commit. In the example above, it’s 499b6d18b36a25d3f5ab9be1b708ea48fef1dd65 (but yours will be different!). However, 40 characters just isn’t very practical to type out, so you can use however much of an ID you need to pick out a single, unique commit. By default, Git suggests you use the first seven characters.

Exploring Absolute History

Using unique commit IDs, get a summary of all the changes you’ve made to the climate_analysis.py file since the initial commit.

Solution

We can use git log to get a list of all of our commits we’ve made:

$ git log
[snipped for space]
commit 499b6d18b36a25d3f5ab9be1b708ea48fef1dd65 (origin/main, origin/HEAD)
Author: Sam Mangham <mangham@gmail.com>
Date:   Wed Mar 16 14:19:13 2022 +0000

    Initial commit

Then, we can take the first 7 characters of the commit ID of the initial commit, and use them with git diff:

$ git diff 499b6d1 climate_analysis.py
diff --git a/climate_analysis.py b/climate_analysis.py
index 277d6c7..6f8ed8a 100644
--- a/climate_analysis.py
+++ b/climate_analysis.py
@@ -1,3 +1,4 @@
+""" Climate Analysis Tools """
import sys
import temp_conversion
import signal
@@ -25,3 +26,5 @@ for line in climate_data:
             kelvin = temp_conversion.fahr_to_kelvin(fahr)
 
             print(str(celsius)+", "+str(kelvin))
+
+# TODO: Add rainfall processing code

Being able to reference specific commits absolutely is particularly handy, as it lets you exactly identify specific versions of the code. For example, you can identify the version of the code you used to write your first paper, and the different, newer version you used to write your second paper.

Differencing

Other Ways To Reference Commits

Newer versions of Git have some more advanced ways of referencing past commits. In place of HEAD~1 you can use HEAD~ or HEAD@{1}, or you can even use text to ask more advanced questions, like git diff HEAD@{"yesterday"} or git diff HEAD@{"3 months ago"}!

You can also ‘tag’ a commit with a name, using git tag to create tags with IDs and descriptions. If you’ve got a version of the code you’ve used in a paper, for example, tagging it is a good idea:

$ git tag -a v1.0 -m "Version 1.0, used in paper Mangham2024."

Restoring Files

All right: we can save changes to files and see what we’ve changed — suppose we need to restore older versions of things?

Let’s suppose we accidentally overwrite or delete our file:

$ rm climate_analysis.py
$ ls
README.md
temp_conversion.py

Whoops!

git status now tells us that the file has been changed, but those changes haven’t been staged:

$ git status
# On branch main
# Your branch is ahead of 'origin/main' by 3 commits.
#   (use "git push" to publish your local commits)
#
# Changes not staged for commit:
#   (use "git add/rm <file>..." to update what will be committed)
#   (use "git restore <file>..." to discard changes in working directory)
#
#	deleted:    climate_analysis.py
#
no changes added to commit (use "git add" and/or "git commit -a")

Following the helpful hint in that output, we can put things back the way they were by using git restore:

$ git restore climate_analysis.py
$ cat climate_analysis.py
[SNIPPED - but changes rolled back]

By default, restore replaces the file with the version of it in the staging area. If you haven’t used git add, that should be the same as the version in the last commit. But what if we already used git add on our incorrect version of a file, or we broke the file more than one commit ago?

We can use git checkout, e.g.:

$ git checkout <HEAD or commit ID> climate_analysis.py

Compatibility Notice

Older versions of Git don’t include the git restore command - fortunately, it’s just a shortcut for git checkout --. If git restore doesn’t work, try git checkout -- temp_conversion.py. checkout has a lot of functions, and newer versions of Git simplify things by giving them new names.

Double Whoops

What if you accidentally did git rm climate_analysis.py? That command tells Git to delete the file and remove it from the repository - so it will record that the file has been deleted, then stop tracking further changes. Even if you re-make the file, it won’t be tracked until you use git add on it again.

The file still exists in the history, though so if you want to undo this you can do git checkout HEAD climate_analysis.py, to get the file back and start tracking it again. Since you can retrieve any file that existed in a previous commit, even if you removed it from future ones, this makes it important not to commit files containing passwords or sensitive information! To avoid this, you can use a .gitignore file to prevent you adding sensitive files in the first place. You can fully delete files from a repository’s history with tools like the BFG but it can be high risk.

Restoring Files

The fact that files can be reverted one by one tends to change the way people organize their work.

Consider a situation where all your code is in one file, and you fixed a bug in one section but accidentally introduced one elsewhere.

You can’t just roll back to fix one bug without un-fixing the other. However, if each section is in its own file, you can just roll back the section you broke!

Key Points

  • git diff displays differences between commits.

  • git checkout recovers old versions of files.


Remote Repositories

Overview

Teaching: 25 min
Exercises: 0 min
Questions
  • How do I work with a remote repository?

Objectives
  • Understand git push and git pull

  • Encounter and resolve a conflict

We’ve learned how to use a local repository to store our code and view changes:

Local Repository Commands

Now, however, we’d like to share the changes we’ve made to our code with others, as well as making sure we have an off-site backup in case things go wrong. We need to upload our changes in our local repository to a remote repository.

Why Have an Off-site Backup?

You might wonder why having an off-site backup (i.e. a copy not stored at your University) is so important. In 2005, a fire destroyed a building at the University of Southampton. Some people’s entire PhD projects were wiped out in the blaze. To ensure your PhD only involves a normal level of suffering, please make sure you have off-site backups of as much of your work as possible!

Mountbatten Fire

To do that, we’ll use the remote repository we set up on GitHub at the start of the workshop. It’s another repository, just like the local repository on our computer, that Git makes it easy to send and receive data from. Multiple local repositories can connect to the same remote repository, allowing you to collaborate with colleagues easily.

Remote Repositories

So we’re finally going to address all those “Your branch is ahead of ‘origin/main’ by 3 commits” messages we got from git status! However, GitHub doesn’t let just anyone push to your repository - you need to prove you’re the owner (or have been given access). Fortunately, we already set up an SSH key earlier.

Now we can synchronise our code to the remote repository, with git push:

$ git push
Counting objects: 11, done.
Delta compression using up to 32 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.11 KiB | 0 bytes/s, done.
Total 9 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 1 local object.
To git@github.com:smangham/climate-analysis
   70bf8f3..501e88f  main -> main

And we’re done! This bit was easy as when we used git clone earlier, it set up our local repository to track the remote repository. The main -> main line shows we’re sending our local branch called main to the remote repository as a branch called main.

What is a Branch, Though?

Branches allow you to have alternate versions of the code ‘branching off’ from another branch (e.g. main). You can try out new features in these branches without disrupting your main version of the code, then merge them in once you’ve finished. We have a Stretch Episode that gives you a brief introduction to them!

If we go back to the repository on GitHub, we can refresh the page and see our updates to the code:

Updated remote repository

Conveniently, the contents of README.md are shown on the main page, with formatting. You can also add links, tables and more. Your code should always have a descriptive README.md file, so anyone visiting the repo can easily get started with it.

How often should I push?

Every day. You can never predict when your hard disk will fail or your building will be destroyed! In case of fire, git commit, git push, leave building Credit: Mitch Altman, CC BY-SA 2.0

Collaborating on a Remote Repository

Now we know how to push our work from our local repository to a remote one, we need to know the reverse - how to pull updates to the code that someone else has made.

We want to invite other people to collaborate on our code, so we’ll update the README.md with a request for potential collaborators to email us at our University email address.

nano README.md
cat README.md
# Climate Analysis Toolkit

This is a set of python scripts designed to analyse climate datafiles.

If you're interested in collaborating, email me at s.w.mangham@soton.ac.uk.
git commit -am "Added collaboration info"
[main 39a2c8f] Added collaboration info
 1 file changed, 2 insertions(+)

In this case, we use git commit -am where the -a means commit all modified files we’ve previously used git add on, and the -m bit means ‘and here’s the commit message’ as usual. It’s a handy shortcut.

But don’t push to GitHub just yet! We’re going to set up a small conflict, of the kind you might see when working with a remote repository. What happens if you change a file at the same time as one of your collaborators does, and you both commit those changes? How does GitHub know which version of the file is ‘correct’?

Pretending to be an existing collaborator, we’ll go and add those installation instructions by editing our README.md file directly on GitHub. This isn’t common, but if you want to quickly make some small changes to a single file it can be useful. We edit it as:

GitHub edit button

And just expand it a little, making more use of GitHub’s markdown formatting:

GitHub editing Readme

Then commit the changes directly to our main branch with a descriptive commit message:

GitHub committing edit

Updated remote repository

Push Conflicts

Great. Now let’s go back to the terminal and try pushing our local changes to the remote repository. This is going to cause problems, just as we expected:

git push
To git@github.com:smangham/climate-analysis
 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'git@github.com:smangham/climate-analysis'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first merge the remote changes (e.g.,
hint: 'git pull') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Git helpfully tells us that actually, there are commits present in the remote repository that we don’t have in our local repository.

Merge Conflicts

We’ll need to pull those commits into our local repository before we can push our own updates back!

git pull
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From github.com:smangham/climate-analysis
   501e88f..023f8f6  main       -> origin/main
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.

Compatibility Notice

Newer versions of git will default to attempting to merge conflicting ‘histories’. Older versions might not - and they’ll give you a message like:

hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint: 
hint:   git config pull.rebase false  # merge
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint: 
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
fatal: Need to specity how to reconcile divergent branches 

We want to default to merging. Fast forward and rebase are advanced options you’d typically only see used in large teams in industry. So as git suggests, we can fix it our problem with:

git config --global pull.rebase false
git pull

Now we’ll get the same behaviour as newer versions of git.

We have created a conflict! Both us, and our remote collaborator, both edited README.md. Let’s take a look at the file:

cat README.md
# Climate Analysis Toolkit

This is a set of python scripts designed to analyse climate datafiles.

<<<<<<< HEAD
If you're interested in collaborating, email me at s.w.mangham@soton.ac.uk.
=======
To install a copy of the toolkit, open a terminal and run:

    git clone git@github.com:smangham/climate-analysis.git


**This code is currently in development and not all features will work**
>>>>>>> 493dd81b5d5b34211ccff4b5d0daf8efb3147755

Git has tried to auto-merge the files, but unfortunately failed. It can handle most conflicts by itself, but if two commits edit the exact same part of a file it will need you to help it.

We can see the two different edits we made to the end of the README.md file, in a block defined by <<<, === and >>>. The top block is labelled HEAD (the changes in our latest local commit), whilst the bottom block is labelled with the commit ID of the commit we made on GitHub.

We can easily fix this using nano, by deleting all the markers and keeping the text we want:

nano README.md
cat README.md
# Climate Analysis Toolkit

This is a set of python scripts designed to analyse climate datafiles.

If you're interested in collaborating, email me at s.w.mangham@soton.ac.uk.

To install a copy of the toolkit, open a terminal and run:

   git clone git@github.com:smangham/climate-analysis.git


**This code is currently in development and not all features will work**

Now we’ve got a fixed and finished README.md file, we can commit our changes, and push them up to our remote repository:

git commit -am "Fixed merge conflict"
[main 6f4df16] Fixed merge conflict
git push
Counting objects: 10, done.
Delta compression using up to 32 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 774 bytes | 0 bytes/s, done.
Total 6 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 1 local object.
To git@github.com:smangham/climate-analysis
   023f8f6..09f5151  main -> main

Now back on GitHub we can see that our README.md shows the text from both commits, and our conflict is resolved:

Resolved conflict on GitHub

Now we can successfully collaboratively develop our research code with others.

Conflict Mitigation

If you’ve got multiple different people working on a code at once, then the branches we mentioned earlier can really help reduce conflicts. Each collaborator can work on their own branch, and only merge them back in once everything is finished - dramatically reducing the number of conflicts!

Remote Repository Commands

Key Points

  • Git can easily synchronise your local repository with a remote one

  • GitHub needs an SSH key to allow access

  • Git can resolve ‘conflicting’ modifications to text files


Branching

Overview

Teaching: 20 min
Exercises: 0 min
Questions
  • What are branches?

Objectives
  • Understand why you would use a branch

  • Understand git branch and git merge

Optional Episode

If you don’t want to do this section, just head straight to the survey!

We’ve seen branches mentioned a lot so far - mostly main. So what are they?

A branch is a parallel version of a repository. It can branch off from a commit, contain its own set of extra commits and edits to files, then easily merge back into the branch it came off (or even another!). We can visualise this flow of splitting and merging branches like this:

Git Feature-branch workflow

Why Use Branches?

If you’re a user of a code, and don’t plan to do any development, you might never have to interact with branches. You’ll download the main branch, containing the most recent, stable version of the code, and just use that. Likewise, if you create a new repository for a small code with only a single developer, then as long as you aren’t sharing the code or its outputs you can just do all your work on the main branch like we’ve been doing.

However, if you plan on making changes to an existing code, collaborating with others, or sharing your code or its outputs, then you’ll definitely want to use branches - as they make your life a lot easier.

Sharing Your Code: main and dev branches

As mentioned, if you’re using an existing code written by somebody else, you’ll typically just download the main branch and use that. What if, though, the author(s) of the code want to continue working on it without the potential users downloading half-finished or untested code? They could keep all their changes local and only commit and push once a new feature has been completed and rigorously tested, but that’s not particularly sustainable for large features. It could potentially take months to add a new feature (a long time to go without a backup!), and you might want to share the work-in-progress version with others to test.

The traditional way to do this is to create a development branch (dev or develop) coming off the main branch (main or master). The main branch contains tested, finished code that can be shared with others, whilst the development branch contains work-in-progress code. Typically you merge your development branch into your master branch when your work on it has been tested and is ready to share - for example, when you release a paper using it. Then you can continue working on your development branch and sharing your development code with other other members of your group.

Making Changes to an Existing Code: Feature branches

Once you have a working code, particularly one that’s being shared, you’ll inevitably want to add new features. You could add them directly to your development branch - however, what happens if, mid-way through, you need to pause the feature and switch to something else as you wait for simulations to finish, new data to arrive, or similar? Instead of ending up with a mess of multiple half-finished modifications, which are impossible to evaluate independently of the other, you can instead create a new feature branch coming off of your development branch for each new feature. You work on each new feature or bugfix in their own feature branch, and merge them back into your development branch once they’re tested and complete. Then, as before, once you’re ready to publish a paper using your new functionality you merge it all back into the main branch.

Collaborating With Others: Feature branches

Feature branches also make collaborating with others far easier! Instead of stepping on each other’s toes by making conflicting edits to the same files, you can simply each work on your own branch. GitHub offers features to help manage collaborations too, by limiting who can merge their work into a branch without approval, allowing you to set up workflows where newer team members run their changes past those with experience.

Merging

We’ve mentioned merges repeatedly; as Git tracks the changes made to each file in each commit, it can easily determine whether or not the changes made in two branches conflict with each other. It can intelligently merge together two modified versions of a file where their changes don’t overlap, and highlight sections where they do for you to resolve, showing both versions of the code.

These use the same conflict resolution we saw earlier - new files are added seamlessly, whilst modified files use smart conflict resolution and might need your intervention if there’s a clash!

The Basics

We can use the git branch command to list the branches in our local repository, and let us know which we’re on:

git branch
* main

At the moment, we only have one - main - and the asterisk tells us it’s the one we’re currently on. We can check this by creating a new branch using git branch new_branch_name, and listing them again:

git branch dev
git branch
  dev
* main

Now we’ve got a dev branch set up!

Working with a dev branch

We’ll try a quick example of using the main and dev branches to have a work-in-progress version of the code that we only share when we’ve completed and tested it.

We can switch to our new branch with:

git switch dev
Switched to branch 'dev'

Compatibility Notice

Older versions of Git don’t have git switch - instead, you have to use git checkout dev. As we’ve already seen, checkout has a lot of functions, and newer versions of Git simplify things by giving them new names.

Any commits we make on this branch will exist only on this branch - when you use git switch main to switch back to your main branch, they won’t show up in your git log results!

We’ll give it a try. In one of our earlier edits to climate_analysis.py, we mentioned we wanted to process rainfall measurements in our climate data. Let’s imagine these are historic values, in imperial measurements, that we’ll need to convert. We’ll make a new file, and write a simple function to handle it:

nano rainfall_conversion.py
cat rainfall_conversion.py
def inches_to_mm(inches):
    mm = inches * 25.4
    return mm

Now we’ve made the file, we want to commit it to our dev branch. Make sure you’re on the dev branch with git switch dev if you haven’t already, and then add it like we added our changes before:

git add rainfall_conversion.py
git commit -m "Add rainfall module"
[dev b402781] Add rainfall module
 1 file changed, 4 insertions(+)
 create mode 100644 rainfall_conversion.py

So we’ve successfully made a new file, and committed it to our repository, on the dev branch. Let’s take a look at the directory now using ls:

ls
README.md              climate_analysis.py    rainfall_conversion.py temp_conversion.py

We can see that the rainfall_conversion.py file is all present and correct. But we told git that we made it on the dev branch - what happens if we switch back to main with git switch again?:

git switch main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.
ls
README.md           climate_analysis.py temp_conversion.py

The rainfall_conversion.py file isn’t present, as the commit that created it was made on the dev branch. It still exists, and if we use git switch dev it’ll re-appear. However, whilst we’re on main, it’s tidied away into our hidden .git directory.

This doesn’t just work on new files. If you edit an existing file on dev, then when you switch back to main you’ll see the old version.

Remote Branches

Now we’ve made changes to our dev branch, we want to send them up to GitHub, to make sure that we don’t lose any of our development work! Let’s switch back to dev with git switch:

git switch dev
Switched to branch 'dev'

And use git push to synchonise our branch with GitHub, just like we did earlier. However, this time we’ll get an error:

git push
fatal: The current branch dev has no upstream branch.
To push the current branch and set the remote as upstream, use

    git push --set-upstream origin dev

To have this happen automatically for branches without a tracking
upstream, see 'push.autoSetupRemote' in 'git help config'.

When we used git clone it linked up our main branch with the main branch on our GitHub repository automatically. Our dev branch is new, though, and git doesn’t yet know where it should be pushing it to. Fortunately, git has told us what we need to do to tell it (git is good about this!).

The origin argument to git push tells it which remote repository we’re pushing to (we can see a list of them, and their web addresses, with git remote -v). The dev argument tells it to push the current branch to the remote repository as a branch called dev. The --set-upstream flag tells it that we’re setting this behaviour as the default for this branch.

We’ll use a shortcut for --set-upstream - -u:

git push -u origin dev
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 415 bytes | 415.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
remote: 
remote: Create a pull request for 'dev' on GitHub by visiting:
remote:      https://github.com/smangham/climate-analysis/pull/new/dev
remote: 
To github.com:smangham/climate-analysis.git
 * [new branch]      dev -> dev
branch 'dev' set up to track 'origin/dev'.

Now we’ve got it up on GitHub successfully! Let’s go check on the site:

Switching branch on GitHub

It defaults to showing the main branch, but lets us know there’s been a recent push to a different branch. We can check out what the other branch looks like by clicking on the drop-down on the left and selecting dev:

Viewing dev branch on GitHub

We can see the rainfall_conversion.py file has been uploaded! This makes it easy for us to share work-in-progress versions of our code that others can easily look at.

Linking Remotes

It’s always worth double-checking before you run git push origin dev for the first time - if you’re accidentally still on the main branch, you can end up pushing it to GitHub as a new branch called dev, and having two copies!

To avoid this, we can set the ‘upstream’ for a branch when we make it, using:

git branch --track branchname origin/branchname

But this functionality isn’t available on older versions of git. Alternatively, if your git is new enough to suggest it, you can make it automatically link branches to their remote equivalents with:

git config --global push.autoSetupRemote true

Downloading Branches

It’s easy to share a branch with a collaborator so they can test out a different version of the code. If they clone the repository, like we did back at the start, it defaults to main but they can download the other branches and try them out too, using:

git clone git@github.com:yourname/climate-analysis.git
git fetch
git switch dev

Where git fetch downloads all the branches on the remote repository, not just the main one.

Merging Branches

If we’re happy with the way our work on the dev branch has gone, and we’ve tested it, we can merge the content back in!

Let’s switch back to our main branch:

git switch main
Switched to branch 'main'
Your branch is up to date with 'origin/main'.

Now, to merge the changes from our dev branch into the current (main) branch, we just need to do:

git merge dev
Updating fd30d36..b402781
Fast-forward
 rainfall_conversion.py | 4 ++++
 1 file changed, 4 insertions(+)
 create mode 100644 rainfall_conversion.py

Now, let’s push our updated main branch to GitHub:

git push
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:smangham/climate-analysis.git
   fd30d36..b402781  main -> main

And we can see on GitHub that the two branches are up-to-date:

Main up-to-date on GitHub

Pull Requests

When we looked at GitHub earlier, we saw a banner letting us know we could compare our branches, make a Pull Request:

Main up-to-date on GitHub

A Pull Request is another way of merging branches, that works better when you’re part of a team. There’s an interface for discussing the changes you’ve made with your colleagues, requesting others peer-review your code, and it shows all your changes in detail:

Pull request on GitHub

Then, once you’ve taken a proper look and you’re happy with your changes, you can merge the branches through the GitHub web interface. If you’re working as part of a team, it’s better to make a Pull Request than use than git merge.

Key Points

  • Branches are parallel versions of a repository

  • You can easily switch between branches, and merge their changes

  • Branches help with code sharing and collaboration


Ignoring Things

Overview

Teaching: 5 min
Exercises: 0 min
Questions
  • How can I tell Git to ignore files I don’t want to track?

Objectives
  • Use a .gitignore file to ignore specific files and explain why this is useful.

Optional Episode

If you don’t want to do this section, just head straight to the survey!

What if we have files that we do not want Git to track for us, like backup files created by our editor or intermediate files created during data analysis. Let’s switch to our dev branch, and create a few dummy files:

$ git switch dev
$ mkdir results
$ touch a.dat b.dat c.dat results/a.out results/b.out

and see what Git says:

$ git status
# On branch dev
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	a.dat
#	b.dat
#	c.dat
#	results/
nothing added to commit but untracked files present (use "git add" to track)

Putting these files under version control would be a waste of disk space. What’s worse, having them all listed could distract us from changes that actually matter, so let’s tell Git to ignore them.

We do this by creating a file in the root directory of our project called .gitignore.

$ nano .gitignore
$ cat .gitignore
*.dat
results/

These patterns tell Git to ignore any file whose name ends in .dat and everything in the results directory. (If any of these files were already being tracked, Git would continue to track them.)

Once we have created this file, the output of git status is much cleaner:

$ git status
# On branch dev
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	.gitignore
nothing added to commit but untracked files present (use "git add" to track)

The only thing Git notices now is the newly-created .gitignore file. You might think we wouldn’t want to track it, but everyone we’re sharing our repository with will probably want to ignore the same things that we’re ignoring. Let’s add and commit .gitignore:

$ git add .gitignore
$ git commit -m "Add the ignore file"
$ git status
# On branch dev
nothing to commit, working directory clean

As a bonus, using .gitignore helps us avoid accidentally adding files to the repository that we don’t want.

$ git add a.dat
The following paths are ignored by one of your .gitignore files:
a.dat
Use -f if you really want to add them.
fatal: no files added

If we really want to override our ignore settings, we can use git add -f to force Git to add something. We can also always see the status of ignored files if we want:

$ git status --ignored
# On branch dev
# Ignored files:
#  (use "git add -f <file>..." to include in what will be committed)
#
#        a.dat
#        b.dat
#        c.dat
#        results/

nothing to commit, working directory clean

Force adding can be useful for adding a .gitkeep file. You can’t add empty directories to a repository- they have to have some files within them. But if your code expects there to be a results/ directory to output to, for example, this can be a problem. Users will run your code, and have it error out at a missing directory and have to create it themselves.

Instead, we can create an empty .gitkeep file using touch in the results/ directory, and force-add it. As it starts with a ., it’s a special file and won’t appear in ls (only ls -a), but it will ensure that the directory structure is kept as part of your repository.

Key Points

  • The .gitignore file tells Git what files to ignore.


Survey

Overview

Teaching: min
Exercises: min
Questions
Objectives

Post-Lesson Survey

Key Points