Reference

Key Points

Lesson Schedule
What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up Git
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

Creating a Repository
  • git init initializes a repository.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log lists the commits made to the local repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Collaborating
  • git remote add origin links a local repository to a remote one and names it ‘origin’.

  • git push copies changes from a local repository to a remote repository.

  • git pull copies changes from a remote repository to a local repository.

  • Branches are versions of a repository that can contain different commits.

  • Pull requests on GitHub can be used to merge different branches together.

  • git clone copies a remote repository to create a local repository with a remote called origin automatically set up.

Conflicts
  • Conflicts occur when different commits change the same lines of the same file.

  • The version control system does not allow changes to overwrite each other, but highlights conflicts so that they can be resolved.

  • git checkout -b creates a new branch and checks it out at the same time.

  • git push -u links a local branch with an ‘upstream’ branch on a remote repository.

  • git pull can pull changes from one branch into another locally.

Ignoring Things
  • The .gitignore file tells Git what files to ignore.

Survey
Reference
Lesson Schedule
Introduction
  • OpenRefine is a powerful and free, open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps you take in working with your data, and will leave your original data intact.

Opening and Exploring Data
  • Faceting can identify errors or outliers in data.

Transforming Data
  • Clustering can identify outliers in data and help us fix errors in bulk.

  • GREL (General Refine Expression Language) is a powerful tool for transforming data.

Filtering and Sorting Data
  • OpenRefine provides various ways to sort and filter data without affecting the raw data.

Exporting Data Cleaning Steps
  • All changes are being tracked in OpenRefine (apart from individual cell changes and sorting!), and this information can be used for scripts for future analyses or reproducing an analysis.

  • Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Exporting and Saving Data
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Further Resources on OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Survey
Reference
Lesson Schedule
Introduction
  • Well-made software is easier to expand and reuse

  • You need to produce reproducible research.

  • You are a user of your own code.

Issues
  • Issues are a way of recording bugs or feature requests.

  • Issues can be categorised by type.

  • Issues can reference other issues, and be referenced by commits.

Project Management
  • Projects are broken-down into self-contained tasks.

  • Tasks are represented as cards on a board.

  • Cards are arranged to show their status.

  • Issues can be added to project boards and labelled.

  • Project boards can show the priority of their tasks.

  • Forks are copies of entire repositories that can be synced up with the original.

Release Management
  • Releases are stable versions of the code.

  • Zenodo can automatically generate DOIs for releases.

  • Software licenses can restrict what others can do with your code.

Writing Sustainable Code
  • Always assume that someone else will read your code at a later date, including yourself.

  • Rename variables and functions to add context to make your code more readable.

  • Add comments to explain why something was done in a certain way if not obvious.

  • Don’t add comments that just restate what code clearly already does.

  • Use docstrings contained within """ at the start of functions and files to explain their behaviour and input/output parameters.

Managing a Mini-Project
  • Problems with code and documentation can be tracked as issues.

  • Issues can be managed on a project board.

  • Issues can be fixed using the feature-branch workflow.

  • Stable versions of the code can be published as releases.

Survey
Reference
Lesson Schedule
Python Basics
  • Start the python interpreter by typing python in the shell.

  • Variables are named memory locations, they are used to access data.

Arrays, Lists etc
  • A list is an ordered collection of items of any type.

  • Values in the list can be accessed using their index in square brackets e.g. my_list[ix]

  • Lists can be manipulated in place using attribute functions e.g. my_list.reverse()

  • Ranges of values in a list can be obtained via slicing e.g. mylist[start:stop]

Repeating actions using loops
  • We can use the ‘for in’ syntax to loop over collections or generators.

Processing data files
  • The python function open lets us read r or write w to files by creating a file handler.

  • We can use string operations such as line.split(',') to process data in files.

Making choices
  • We can use logical operations to change the behavior of our code when it meets certain conditions.

  • Using if, elif, and else we can check conditions and add a branch that runs if none of the conditions are met.

  • We can combine conditions using and and or to make more complicated logical statements.

Modularising your code using functions
  • A function is created using the def keyword.

  • Functions take variables that are specified in the function definition and use the return keyword to specify their output.

  • We can use a module to keep our functions separate to the main body of our code to improve code readability.

Handling Errors
  • Python has built in error names that give a hint to the type of problem you are looking for.

  • We can use the traceback to find which bit of the code threw an error.

Command-Line Programs
  • Python uses the sys library to acess command line arguments. sys.argv is a list of command line arguments.

  • Python program outputs can be used in a pipeline, however, due to the way python works we need to use the signal library to make sure it handles piping output correctly.

Reading and analysing Patient data using libraries
  • Python has many libraries that add to the core language to improve functionality in specific use cases.

  • Numpy is a numerical python library that makes working with vectors, matricies, or large data tables easier.

  • Numpy can be used to load datasets directly from CSV files bypassing Pythons built in file systems.

Data Visualisation
  • We can use matplotlib to create and manipulate a wide variety of plots in Python.

  • Once a plot has been made we can use matplotlib’s function savefig to output it in formats appropriate for publication.

Python Style Guide
  • Pep8 provides a guide for styling your python code.

Survey
Challenges
Why Python?
Reference
Lesson Schedule
Introducing the Shell
  • The shell lets you define repeatable workflows.

  • The shell is available on systems where graphical interfaces are not.

Files and Directories
  • The file system is responsible for managing information on the disk.

  • Information is stored in files, which are stored in directories (folders).

  • Directories can also store other directories, which then form a directory tree.

  • cd [path] changes the current working directory.

  • ls [path] prints a listing of a specific file or directory; ls on its own lists the current working directory.

  • pwd prints the user’s current working directory.

  • / on its own is the root directory of the whole file system.

  • Most commands take options (flags) that begin with a -.

  • A relative path specifies a location starting from the current location.

  • An absolute path specifies a location from the root of the file system.

  • Directory names in a path are separated with / on Unix, but \ on Windows.

  • .. means ‘the directory above the current one’; . on its own means ‘the current directory’.

Creating Things
  • Command line text editors let you edit files in the terminal.

  • You can open up files with either command-line or graphical text editors.

  • nano [path] creates a new text file at the location [path], or edits an existing one.

  • cat [path] prints the contents of a file.

  • rmdir [path] deletes an (empty) directory.

  • rm [path] deletes a file, rm -r [path] deletes a directory (and contents!).

  • mv [old_path] [new_path] moves a file or directory from [old_path] to [new_path].

  • mv can be used to rename files, e.g. mv a.txt b.txt.

  • Using . in mv can move a file without renaming it, e.g. mv a/file.txt b/..

  • cp [original_path] [copy_path] creates a copy of a file at a new location.

Pipes and Filters
  • wc counts lines, words, and characters in its inputs.

  • cat displays the contents of its inputs.

  • sort sorts its inputs.

  • head displays the first 10 lines of its input.

  • tail displays the last 10 lines of its input.

  • command > [file] redirects a command’s output to a file (overwriting any existing content).

  • command >> [file] appends a command’s output to a file.

  • [first] | [second] is a pipeline: the output of the first command is used as the input to the second.

  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

Shell Scripts
  • Save commands in files (usually called shell scripts) for re-use.

  • bash [filename] runs the commands saved in a file.

  • $@ refers to all of a shell script’s command-line arguments.

  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.

  • Place variables in quotes if the values might have spaces in them.

  • Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Loops
  • A for loop repeats commands once for every thing in a list.

  • Every for loop needs a variable to refer to the thing it is currently operating on.

  • Use $name to expand a variable (i.e., get its value). ${name} can also be used.

  • Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.

  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.

  • Use Ctrl+R to search through the previously entered commands.

  • Use history to display recent commands, and ![number] to repeat a command by number.

Finding Things
  • find finds files with specific properties that match patterns.

  • grep selects lines in files that match patterns.

  • --help is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.

  • man [command] displays the manual page for a given command.

  • $([command]) inserts a command’s output in place.

Additional Exercises
  • date prints the current date in a specified format.

  • Scripts can save the output of a command to a variable using $(command)

  • basename removes directories from a path to a file, leaving only the name

  • cut lets you select specific columns from files, with -d',' letting you select the column separator, and -f letting you select the columns you want.

Survey
Reference
Lesson Schedule
Introduction
  • Good data organisation is the foundation of any research project.

Organising data in spreadsheets
  • Never modify your raw data. Always make a copy before making any changes.

  • Keep track of all of the steps you take to clean your data in a plain text file.

  • Organise your data according to tidy data principles.

  • Record metadata in a separate plain text file (such as README.txt) in your project root folder or folder with data.

Common spreadsheet errors
  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs.

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Do not use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

Dates as data
  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.

Quality assurance and control
  • Always copy your original spreadsheet file and work with a copy so you do not affect the raw data.

  • Use data validation to prevent accidentally entering invalid data.

  • Use sorting to check for invalid data.

  • Use conditional formatting (cautiously) to check for invalid data.

Exporting data
  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

Survey
Reference

Glossary

cleaned data
data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
a plain text file format in which values are separated by commas
factor
a variable that takes on a limited number of possible values (i.e. categorical data)
metadata
data which describes other data
null value
a value used to record observations missing from a dataset
observation
a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text
unformatted text
quality assurance
any process which checks data for validity during entry
quality control
any process which removes problematic data from a dataset
raw data
data that has not been manipulated and represents actual recorded values
rich text
formatted text (e.g. text that appears bolded, colored or italicized)
string
a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format
a plain text file format in which values are separated by tabs
variable
a category of data being collected on the object being recorded (e.g. a mouse’s weight)