Reference

Key Points

Lesson Schedule
What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up Git
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

  • GitHub needs an SSH key to allow access

Creating a Repository
  • git clone creates a local copy of a repository from a URL.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log --decorate lists the commits made to the local repository, along with whether or not they are up-to-date with any remote repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Remote Repositories
  • Git can easily synchronise your local repository with a remote one

  • GitHub needs an SSH key to allow access

  • Git can resolve ‘conflicting’ modifications to text files

Branching
  • Branches are parallel versions of a repository

  • You can easily switch between branches, and merge their changes

  • Branches help with code sharing and collaboration

Ignoring Things
  • The .gitignore file tells Git what files to ignore.

Survey
Reference
Lesson Schedule
Introduction
  • OpenRefine is a powerful and free, open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps you take in working with your data, and will leave your original data intact.

Opening and Exploring Data
  • Faceting can identify errors or outliers in data.

Transforming Data
  • Clustering can identify outliers in data and help us fix errors in bulk.

  • GREL (General Refine Expression Language) is a powerful tool for transforming data.

Filtering and Sorting Data
  • OpenRefine provides various ways to sort and filter data without affecting the raw data.

Exporting Data Cleaning Steps
  • All changes are being tracked in OpenRefine (apart from individual cell changes and sorting!), and this information can be used for scripts for future analyses or reproducing an analysis.

  • Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Exporting and Saving Data
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Further Resources on OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Survey
Reference
Lesson Schedule
Introduction
  • Well-made software is easier to expand and reuse

  • You need to produce reproducible research.

  • You are a user of your own code.

Issues
  • Issues are a way of recording bugs or feature requests.

  • Issues can be categorised by type.

  • Issues can reference other issues, and be referenced by commits.

Project Management
  • Projects are broken-down into self-contained tasks.

  • Tasks are represented as cards on a board.

  • Cards are arranged to show their status.

  • Issues can be added to project boards and labelled.

  • Project boards can show the priority of their tasks.

  • Forks are copies of entire repositories that can be synced up with the original.

Release Management
  • Releases are stable versions of the code.

  • Zenodo can automatically generate DOIs for releases.

  • Software licenses can restrict what others can do with your code.

Writing Sustainable Code
  • Always assume that someone else will read your code at a later date, including yourself.

  • Rename variables and functions to add context to make your code more readable.

  • Add comments to explain why something was done in a certain way if not obvious.

  • Don’t add comments that just restate what code clearly already does.

  • Use docstrings at the start of functions and files to explain their behaviour and input/output parameters.

Managing a Mini-Project
  • Problems with code and documentation can be tracked as issues.

  • Issues can be managed on a project board.

  • Issues can be fixed using the feature-branch workflow.

  • Stable versions of the code can be published as releases.

Survey
Reference
Lesson Schedule
Python Basics
  • Start the python interpreter by typing python in the shell.

  • Variables are named memory locations, they are used to access data.

Arrays, Lists etc
  • A list is an ordered collection of items of any type.

  • Values in the list can be accessed using their index in square brackets e.g. my_list[ix]

  • Lists can be manipulated in place using attribute functions e.g. my_list.reverse()

  • Ranges of values in a list can be obtained via slicing e.g. mylist[start:stop]

Repeating actions using loops
  • We can use the ‘for in’ syntax to loop over collections or generators.

Processing data files
  • The python function open lets us read r or write w to files by creating a file handler.

  • We can use string operations such as line.split(',') to process data in files.

Making choices
  • We can use logical operations to change the behavior of our code when it meets certain conditions.

  • Using if, elif, and else we can check conditions and add a branch that runs if none of the conditions are met.

  • We can combine conditions using and and or to make more complicated logical statements.

Modularising your code using functions
  • A function is created using the def keyword.

  • Functions take variables that are specified in the function definition and use the return keyword to specify their output.

  • We can use a module to keep our functions separate to the main body of our code to improve code readability.

Handling Errors
  • Python has built in error names that give a hint to the type of problem you are looking for.

  • We can use the traceback to find which bit of the code threw an error.

Command-Line Programs
  • Python uses the sys library to acess command line arguments. sys.argv is a list of command line arguments.

  • Python program outputs can be used in a pipeline, however, due to the way python works we need to use the signal library to make sure it handles piping output correctly.

Reading and analysing Patient data using libraries
  • Python has many libraries that add to the core language to improve functionality in specific use cases.

  • Numpy is a numerical python library that makes working with vectors, matricies, or large data tables easier.

  • Numpy can be used to load datasets directly from CSV files bypassing Pythons built in file systems.

Data Visualisation
  • We can use matplotlib to create and manipulate a wide variety of plots in Python.

  • Once a plot has been made we can use matplotlib’s function savefig to output it in formats appropriate for publication.

Python Style Guide
  • Pep8 provides a guide for styling your python code.

Survey
Challenges
Why Python?
Reference
Lesson Schedule
Day 1: Starting with Data
  • Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics.

  • Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more).

  • Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value.

  • A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters.

  • Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test).

  • Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv().

  • Data frames are a data structure for most tabular data, and what we use for statistics and plotting.

  • It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers.

  • Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.

Day 2: Manipulating Data
  • dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package.

  • A subset of columns from a dataframe can be selected using select().

  • To choose rows based on a specific criterion, use filter().

  • Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

  • To create new columns based on the values in existing columns, use mutate().

  • Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions.

  • Dates can be formatted using the package ‘lubridate’.

  • To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package.

  • Export data from a dataframe to a csv file using write_csv().

Day 3: Visualising Data
  • ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame.

  • Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph.

  • Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot.

  • Faceting splits one plot into multiple plots based on a factor from the dataset.

  • Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.

  • The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange().

  • Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.

Survey
Reference
Lesson Schedule
Introducing the Shell
  • The shell lets you define repeatable workflows.

  • The shell is available on systems where graphical interfaces are not.

Files and Directories
  • The file system is responsible for managing information on the disk.

  • Information is stored in files, which are stored in directories (folders).

  • Directories can also store other directories, which then form a directory tree.

  • cd [path] changes the current working directory.

  • ls [path] prints a listing of a specific file or directory; ls on its own lists the current working directory.

  • pwd prints the user’s current working directory.

  • / on its own is the root directory of the whole file system.

  • Most commands take options (flags) that begin with a -.

  • A relative path specifies a location starting from the current location.

  • An absolute path specifies a location from the root of the file system.

  • Directory names in a path are separated with / on Unix, but \ on Windows.

  • . on its own means ‘the current directory’; ..` means ‘the directory above the current one’.

  • --help is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.

  • man [command] displays the manual page for a given command.

Creating Things
  • Command line text editors let you edit files in the terminal.

  • You can open up files with either command-line or graphical text editors.

  • nano [path] creates a new text file at the location [path], or edits an existing one.

  • cat [path] prints the contents of a file.

  • rmdir [path] deletes an (empty) directory.

  • rm [path] deletes a file, rm -r [path] deletes a directory (and contents!).

  • mv [old_path] [new_path] moves a file or directory from [old_path] to [new_path].

  • mv can be used to rename files, e.g. mv a.txt b.txt.

  • Using . in mv can move a file without renaming it, e.g. mv a/file.txt b/..

  • cp [original_path] [copy_path] creates a copy of a file at a new location.

Wildcards, Pipes and Filters
  • wc counts lines, words, and characters in its inputs.

  • * matches zero or more characters in a filename, so *.txt matches all files ending in .txt.

  • ? matches any single character in a filename, so ?.txt matches a.txt but not any.txt.

  • cat displays the contents of its inputs.

  • sort sorts its inputs.

  • head displays the first 10 lines of its input.

  • tail displays the last 10 lines of its input.

  • command > [file] redirects a command’s output to a file (overwriting any existing content).

  • command >> [file] appends a command’s output to a file.

  • [first] | [second] is a pipeline: the output of the first command is used as the input to the second.

  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

Finding Things
  • find finds files with specific properties that match patterns.

  • grep selects lines in files that match patterns.

  • $([command]) inserts a command’s output in place.

Shell Scripts
  • Save commands in files (usually called shell scripts) for re-use.

  • bash [filename] runs the commands saved in a file.

  • $@ refers to all of a shell script’s command-line arguments.

  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.

  • Use Ctrl+R to search through the previously entered commands.

  • Use history to display recent commands, and ![number] to repeat a command by number.

  • Place variables in quotes if the values might have spaces in them.

  • Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Loops
  • A for loop repeats commands once for every thing in a list.

  • Every for loop needs a variable to refer to the thing it is currently operating on.

  • Use $name to expand a variable (i.e., get its value). ${name} can also be used.

  • Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.

  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

Additional Exercises
  • date prints the current date in a specified format.

  • Scripts can save the output of a command to a variable using $(command)

  • basename removes directories from a path to a file, leaving only the name

  • cut lets you select specific columns from files, with -d',' letting you select the column separator, and -f letting you select the columns you want.

Survey
Reference
Lesson Schedule
Introduction
  • Good data organisation is the foundation of any research project.

Organising data in spreadsheets
  • Never modify your raw data. Always make a copy before making any changes.

  • Keep track of all of the steps you take to clean your data in a plain text file.

  • Organise your data according to tidy data principles.

  • Record metadata in a separate plain text file (such as README.txt) in your project root folder or folder with data.

Common spreadsheet errors
  • Include only one piece of information in a cell.

  • Avoid using multiple tables or spreading data about multiple tabs within one spreadsheet.

  • Record zeros as zeros.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

  • Use an appropriate null value to record missing data.

  • Record units in column headers.

  • Place comments in a separate column.

  • Do not use formatting to convey information.

Dates as data
  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.

Quality assurance and control
  • Always copy your original spreadsheet file and work with a copy so you do not affect the raw data.

  • Use data validation to prevent accidentally entering invalid data.

Exporting data
  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

Survey
Reference

Glossary

cleaned data
data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
a plain text file format in which values are separated by commas
factor
a variable that takes on a limited number of possible values (i.e. categorical data)
metadata
data which describes other data
null value
a value used to record observations missing from a dataset
observation
a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text
unformatted text
quality assurance
any process which checks data for validity during entry
quality control
any process which removes problematic data from a dataset
raw data
data that has not been manipulated and represents actual recorded values
rich text
formatted text (e.g. text that appears bolded, colored or italicized)
string
a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format
a plain text file format in which values are separated by tabs
variable
a category of data being collected on the object being recorded (e.g. a mouse’s weight)