What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up Git
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

Creating a Repository
  • git init initializes a repository.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log lists the commits made to the local repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

  • git remote add origin links a local repository to a remote one and names it ‘origin’.

  • git push copies changes from a local repository to a remote repository.

  • git pull copies changes from a remote repository to a local repository.

  • Branches are versions of a repository that can contain different commits.

  • Pull requests on GitHub can be used to merge different branches together.

  • git clone copies a remote repository to create a local repository with a remote called origin automatically set up.

  • Conflicts occur when different commits change the same lines of the same file.

  • The version control system does not allow changes to overwrite each other, but highlights conflicts so that they can be resolved.

  • git checkout -b creates a new branch and checks it out at the same time.

  • git push -u links a local branch with an ‘upstream’ branch on a remote repository.

  • git pull can pull changes from one branch into another locally.

Ignoring Things
  • The .gitignore file tells Git what files to ignore.

Python Basics
  • Start the python interpreter by typing python in the shell.

  • Variables are named memory locations, they are used to access data.

Arrays, Lists etc
  • A list is an ordered collection of items of any type.

  • Values in the list can be accessed using their index in square brackets e.g. my_list[ix]

  • Lists can be manipulated in place using attribute functions e.g. my_list.reverse()

  • Ranges of values in a list can be obtained via slicing e.g. mylist[start:stop]

Repeating actions using loops
  • We can use the ‘for in’ syntax to loop over collections or generators.

Processing data files
  • The python function open lets us read r or write w to files by creating a file handler.

  • We can use string operations such as line.split(',') to process data in files.

Making choices
  • We can use logical operations to change the behavior of our code when it meets certain conditions.

  • Using if, elif, and else we can check conditions and add a branch that runs if none of the conditions are met.

  • We can combine conditions using and and or to make more complicated logical statements.

Modularising your code using functions
  • A function is created using the def keyword.

  • Functions take variables that are specified in the function definition and use the return keyword to specify their output.

  • We can use a module to keep our functions separate to the main body of our code to improve code readability.

Handling Errors
  • Python has built in error names that give a hint to the type of problem you are looking for.

  • We can use the traceback to find which bit of the code threw an error.

Command-Line Programs
  • Python uses the sys library to acess command line arguments. sys.argv is a list of command line arguments.

  • Python program outputs can be used in a pipeline, however, due to the way python works we need to use the signal library to make sure it handles piping output correctly.

Reading and analysing Patient data using libraries
  • Python has many libraries that add to the core language to improve functionality in specific use cases.

  • Numpy is a numerical python library that makes working with vectors, matricies, or large data tables easier.

  • Numpy can be used to load datasets directly from CSV files bypassing Pythons built in file systems.

Data Visualisation
  • We can use matplotlib to create and manipulate a wide variety of plots in Python.

  • Once a plot has been made we can use matplotlib’s function savefig to output it in formats appropriate for publication.

Python Style Guide
  • Pep8 provides a guide for styling your python code.

Why Python?
Introducing the Shell
  • The shell lets you define repeatable workflows.

  • The shell is available on systems where graphical interfaces are not.

Files and Directories
  • The file system is responsible for managing information on the disk.

  • Information is stored in files, which are stored in directories (folders).

  • Directories can also store other directories, which then form a directory tree.

  • cd [path] changes the current working directory.

  • ls [path] prints a listing of a specific file or directory; ls on its own lists the current working directory.

  • pwd prints the user’s current working directory.

  • / on its own is the root directory of the whole file system.

  • Most commands take options (flags) that begin with a -.

  • A relative path specifies a location starting from the current location.

  • An absolute path specifies a location from the root of the file system.

  • Directory names in a path are separated with / on Unix, but \ on Windows.

  • .. means ‘the directory above the current one’; . on its own means ‘the current directory’.

Creating Things
  • Command line text editors let you edit files in the terminal.

  • You can open up files with either command-line or graphical text editors.

  • nano [path] creates a new text file at the location [path], or edits an existing one.

  • cat [path] prints the contents of a file.

  • rmdir [path] deletes an (empty) directory.

  • rm [path] deletes a file, rm -r [path] deletes a directory (and contents!).

  • mv [old_path] [new_path] moves a file or directory from [old_path] to [new_path].

  • mv can be used to rename files, e.g. mv a.txt b.txt.

  • Using . in mv can move a file without renaming it, e.g. mv a/file.txt b/..

  • cp [original_path] [copy_path] creates a copy of a file at a new location.

Pipes and Filters
  • wc counts lines, words, and characters in its inputs.

  • cat displays the contents of its inputs.

  • sort sorts its inputs.

  • head displays the first 10 lines of its input.

  • tail displays the last 10 lines of its input.

  • command > [file] redirects a command’s output to a file (overwriting any existing content).

  • command >> [file] appends a command’s output to a file.

  • [first] | [second] is a pipeline: the output of the first command is used as the input to the second.

  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

Shell Scripts
  • Save commands in files (usually called shell scripts) for re-use.

  • bash [filename] runs the commands saved in a file.

  • $@ refers to all of a shell script’s command-line arguments.

  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.

  • Place variables in quotes if the values might have spaces in them.

  • Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

  • A for loop repeats commands once for every thing in a list.

  • Every for loop needs a variable to refer to the thing it is currently operating on.

  • Use $name to expand a variable (i.e., get its value). ${name} can also be used.

  • Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.

  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.

  • Use Ctrl+R to search through the previously entered commands.

  • Use history to display recent commands, and ![number] to repeat a command by number.

Finding Things
  • find finds files with specific properties that match patterns.

  • grep selects lines in files that match patterns.

  • --help is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.

  • man [command] displays the manual page for a given command.

  • $([command]) inserts a command’s output in place.

Additional Exercises
  • date prints the current date in a specified format.

  • Scripts can save the output of a command to a variable using $(command)

  • basename removes directories from a path to a file, leaving only the name

  • cut lets you select specific columns from files, with -d',' letting you select the column separator, and -f letting you select the columns you want.


absolute path
A path that refers to a particular location in a file system. Absolute paths are usually written with respect to the file system’s root directory, and begin with either “/” (on Unix) or “\” (on Microsoft Windows). See also: relative path.
A value given to a function or program when it runs. The term is often used interchangeably (and inconsistently) with parameter.
command shell
See shell
command-line interface
An interface based on typing commands, usually at a REPL. See also: graphical user interface.
A remark in a program that is intended to help human readers understand what is going on, but is ignored by the computer. Comments in Python, R, and the Unix shell start with a # character and run to the end of the line; comments in SQL start with --, and other languages have other conventions.
current working directory
The directory that relative paths are calculated from; equivalently, the place where files referenced by name only are searched for. Every process has a current working directory. The current working directory is usually referred to using the shorthand notation . (pronounced “dot”).
file system
A set of files, directories, and I/O devices (such as keyboards and screens). A file system may be spread across many physical devices, or many file systems may be stored on a single physical device; the operating system manages access.
filename extension
The portion of a file’s name that comes after the final “.” character. By convention this identifies the file’s type: .txt means “text file”, .png means “Portable Network Graphics file”, and so on. These conventions are not enforced by most operating systems: it is perfectly possible to name an MP3 sound file homepage.html. Since many applications use filename extensions to identify the MIME type of the file, misnaming files may cause those applications to fail.
A program that transforms a stream of data. Many Unix command-line tools are written as filters: they read data from standard input, process it, and write the result to standard output.
A terse way to specify an option or setting to a command-line program. By convention Unix applications use a dash followed by a single letter, such as -v, or two dashes followed by a word, such as --verbose, while DOS applications use a slash, such as /V. Depending on the application, a flag may be followed by a single argument, as in -o /tmp/output.txt.
for loop
A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
graphical user interface
A graphical user interface, usually controlled by using a mouse. See also: command-line interface.
home directory
The default directory associated with an account on a computer system. By convention, all of a user’s files are stored in or below her home directory.
A set of instructions to be executed multiple times. Consists of a loop body and (usually) a condition for exiting the loop. See also for loop and while loop.
loop body
The set of statements or commands that are repeated inside a for loop or while loop.
MIME type
MIME (Multi-Purpose Internet Mail Extensions) types describe different file types for exchange on the Internet, for example images, audio, and documents.
operating system
Software that manages interactions between users, hardware, and software processes. Common examples are Linux, OS X, and Windows.
To have meanings or behaviors that are independent of each other. If a set of concepts or tools are orthogonal, they can be combined in any way.
A variable named in the function’s declaration that is used to hold a value passed into the call. The term is often used interchangeably (and inconsistently) with argument.
parent directory
The directory that “contains” the one in question. Every directory in a file system except the root directory has a parent. A directory’s parent is usually referred to using the shorthand notation .. (pronounced “dot dot”).
A description that specifies the location of a file or directory within a file system. See also: absolute path, relative path.
A connection from the output of one program to the input of another. When two or more programs are connected in this way, they are called a “pipeline”.
A running instance of a program, containing code, variable values, open files and network connections, and so on. Processes are the “actors” that the operating system manages; it typically runs each process for a few milliseconds at a time to give the impression that they are executing simultaneously.
A character or characters display by a REPL to show that it is waiting for its next command.
(in the shell): Using quotation marks of various kinds to prevent the shell from interpreting special characters. For example, to pass the string *.txt to a program, it is usually necessary to write it as '*.txt' (with single quotes) so that the shell will not try to expand the * wildcard.
read-evaluate-print loop
(REPL): A command-line interface that reads a command from the user, executes it, prints the result, and waits for another command.
To send a command’s output to a file rather than to the screen or another command, or equivalently to read a command’s input from a file.
regular expression
A pattern that specifies a set of character strings. REs are most often used to find sequences of characters in strings.
relative path
A path that specifies the location of a file or directory with respect to the current working directory. Any path that does not begin with a separator character (“/” or “\”) is a relative path. See also: absolute path.
root directory
The top-most directory in a file system. Its name is “/” on Unix (including Linux and Mac OS X) and “\” on Microsoft Windows.
A command-line interface such as Bash (the Bourne-Again Shell) or the Microsoft Windows DOS shell that allows a user to interact with the operating system.
shell script
A set of shell commands stored in a file for re-use. A shell script is a program executed by the shell; the name “script” is used for historical reasons.
standard input
A process’s default input stream. In interactive command-line applications, it is typically connected to the keyboard; in a pipe, it receives data from the standard output of the preceding process.
standard output
A process’s default output stream. In interactive command-line applications, data sent to standard output is displayed on the screen; in a pipe, it is passed to the standard input of the next process.
A directory contained within another directory.
tab completion
A feature provided by many interactive systems in which pressing the Tab key triggers automatic completion of the current word or command.
A name in a program that is associated with a value or a collection of values.
while loop
A loop that keeps executing as long as some condition is true. See also: for loop.
A character used in pattern matching. In the Unix shell, the wildcard * matches zero or more characters, so that *.txt matches all files whose names end in .txt.