Reference

Key Points

Lesson Schedule
What is Version Control
  • Version control is like an unlimited ‘undo’.

  • Version control also allows many people to work in parallel.

Setting Up Git
  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

Creating a Repository
  • git init initializes a repository.

  • Git stores all of its repository data in the .git directory.

Tracking Changes
  • git status shows the status of a repository.

  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

  • git add puts files in the staging area.

  • git commit saves the staged content as a new commit in the local repository.

  • Write commit messages that accurately describe your changes.

  • git log lists the commits made to the local repository.

Exploring History
  • git diff displays differences between commits.

  • git checkout recovers old versions of files.

Collaborating
  • git remote add origin links a local repository to a remote one and names it ‘origin’.

  • git push copies changes from a local repository to a remote repository.

  • git pull copies changes from a remote repository to a local repository.

  • Branches are versions of a repository that can contain different commits.

  • Pull requests on GitHub can be used to merge different branches together.

  • git clone copies a remote repository to create a local repository with a remote called origin automatically set up.

Conflicts
  • Conflicts occur when different commits change the same lines of the same file.

  • The version control system does not allow changes to overwrite each other, but highlights conflicts so that they can be resolved.

  • git checkout -b creates a new branch and checks it out at the same time.

  • git push -u links a local branch with an ‘upstream’ branch on a remote repository.

  • git pull can pull changes from one branch into another locally.

Ignoring Things
  • The .gitignore file tells Git what files to ignore.

Survey
Reference
Lesson Schedule
Introduction
  • OpenRefine is a powerful and free, open source tool that can be used for data cleaning.

  • OpenRefine will automatically track any steps you take in working with your data, and will leave your original data intact.

Opening and Exploring Data
  • Faceting can identify errors or outliers in data.

Transforming Data
  • Clustering can identify outliers in data and help us fix errors in bulk.

  • GREL (General Refine Expression Language) is a powerful tool for transforming data.

Filtering and Sorting Data
  • OpenRefine provides various ways to sort and filter data without affecting the raw data.

Exporting Data Cleaning Steps
  • All changes are being tracked in OpenRefine (apart from individual cell changes and sorting!), and this information can be used for scripts for future analyses or reproducing an analysis.

  • Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Exporting and Saving Data
  • Cleaned data or entire projects can be exported from OpenRefine.

  • Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Further Resources on OpenRefine
  • Other examples and resources online are good for learning more about OpenRefine

Survey
Reference
Lesson Schedule
Introduction
  • Well-made software is easier to expand and reuse

  • You need to produce reproducible research.

  • You are a user of your own code.

Issues
  • Issues are a way of recording bugs or feature requests.

  • Issues can be categorised by type.

  • Issues can reference other issues, and be referenced by commits.

Project Management
  • Projects are broken-down into self-contained tasks.

  • Tasks are represented as cards on a board.

  • Cards are arranged to show their status.

  • Issues can be added to project boards and labelled.

  • Project boards can show the priority of their tasks.

  • Forks are copies of entire repositories that can be synced up with the original.

Release Management
  • Releases are stable versions of the code.

  • Zenodo can automatically generate DOIs for releases.

  • Software licenses can restrict what others can do with your code.

Writing Sustainable Code
  • Always assume that someone else will read your code at a later date, including yourself.

  • Rename variables and functions to add context to make your code more readable.

  • Add comments to explain why something was done in a certain way if not obvious.

  • Don’t add comments that just restate what code clearly already does.

  • Use docstrings contained within """ at the start of functions and files to explain their behaviour and input/output parameters.

Managing a Mini-Project
  • Problems with code and documentation can be tracked as issues.

  • Issues can be managed on a project board.

  • Issues can be fixed using the feature-branch workflow.

  • Stable versions of the code can be published as releases.

Survey
Reference
Lesson Schedule
Python Basics
Arrays, Lists etc
Repeating actions using loops
Processing data files
Making choices
Modularising your code using functions
Handling Errors
Command-Line Programs
Reading and analysing Patient data using libraries
Data Visualisation
Python Style Guide
Survey
Challenges
Why Python?
Reference
Lesson Schedule
Day 1: Starting with Data
Day 2: Manipulating Data
Day 3: Visualising Data
Lesson Schedule
Introducing the Shell
  • The shell lets you define repeatable workflows.

  • The shell is available on systems where graphical interfaces are not.

Files and Directories
  • The file system is responsible for managing information on the disk.

  • Information is stored in files, which are stored in directories (folders).

  • Directories can also store other directories, which then form a directory tree.

  • cd [path] changes the current working directory.

  • ls [path] prints a listing of a specific file or directory; ls on its own lists the current working directory.

  • pwd prints the user’s current working directory.

  • / on its own is the root directory of the whole file system.

  • Most commands take options (flags) that begin with a -.

  • A relative path specifies a location starting from the current location.

  • An absolute path specifies a location from the root of the file system.

  • Directory names in a path are separated with / on Unix, but \ on Windows.

  • .. means ‘the directory above the current one’; . on its own means ‘the current directory’.

Creating Things
  • Command line text editors let you edit files in the terminal.

  • You can open up files with either command-line or graphical text editors.

  • nano [path] creates a new text file at the location [path], or edits an existing one.

  • cat [path] prints the contents of a file.

  • rmdir [path] deletes an (empty) directory.

  • rm [path] deletes a file, rm -r [path] deletes a directory (and contents!).

  • mv [old_path] [new_path] moves a file or directory from [old_path] to [new_path].

  • mv can be used to rename files, e.g. mv a.txt b.txt.

  • Using . in mv can move a file without renaming it, e.g. mv a/file.txt b/..

  • cp [original_path] [copy_path] creates a copy of a file at a new location.

Pipes and Filters
  • wc counts lines, words, and characters in its inputs.

  • cat displays the contents of its inputs.

  • sort sorts its inputs.

  • head displays the first 10 lines of its input.

  • tail displays the last 10 lines of its input.

  • command > [file] redirects a command’s output to a file (overwriting any existing content).

  • command >> [file] appends a command’s output to a file.

  • [first] | [second] is a pipeline: the output of the first command is used as the input to the second.

  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

Shell Scripts
  • Save commands in files (usually called shell scripts) for re-use.

  • bash [filename] runs the commands saved in a file.

  • $@ refers to all of a shell script’s command-line arguments.

  • $1, $2, etc., refer to the first command-line argument, the second command-line argument, etc.

  • Place variables in quotes if the values might have spaces in them.

  • Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.

Loops
  • A for loop repeats commands once for every thing in a list.

  • Every for loop needs a variable to refer to the thing it is currently operating on.

  • Use $name to expand a variable (i.e., get its value). ${name} can also be used.

  • Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion.

  • Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

  • Use the up-arrow key to scroll up through previous commands to edit and repeat them.

  • Use Ctrl+R to search through the previously entered commands.

  • Use history to display recent commands, and ![number] to repeat a command by number.

Finding Things
  • find finds files with specific properties that match patterns.

  • grep selects lines in files that match patterns.

  • --help is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs.

  • man [command] displays the manual page for a given command.

  • $([command]) inserts a command’s output in place.

Additional Exercises
  • date prints the current date in a specified format.

  • Scripts can save the output of a command to a variable using $(command)

  • basename removes directories from a path to a file, leaving only the name

  • cut lets you select specific columns from files, with -d',' letting you select the column separator, and -f letting you select the columns you want.

Survey
Reference
Lesson Schedule
Introduction
  • Good data organisation is the foundation of any research project.

Organising data in spreadsheets
  • Never modify your raw data. Always make a copy before making any changes.

  • Keep track of all of the steps you take to clean your data in a plain text file.

  • Organise your data according to tidy data principles.

  • Record metadata in a separate plain text file (such as README.txt) in your project root folder or folder with data.

Common spreadsheet errors
  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs.

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Do not use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

Dates as data
  • Use extreme caution when working with date data.

  • Splitting dates into their component values can make them easier to handle.

Quality assurance and control
  • Always copy your original spreadsheet file and work with a copy so you do not affect the raw data.

  • Use data validation to prevent accidentally entering invalid data.

  • Use sorting to check for invalid data.

  • Use conditional formatting (cautiously) to check for invalid data.

Exporting data
  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.

  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

Survey
Reference

Glossary

cleaned data
data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
a plain text file format in which values are separated by commas
factor
a variable that takes on a limited number of possible values (i.e. categorical data)
metadata
data which describes other data
null value
a value used to record observations missing from a dataset
observation
a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text
unformatted text
quality assurance
any process which checks data for validity during entry
quality control
any process which removes problematic data from a dataset
raw data
data that has not been manipulated and represents actual recorded values
rich text
formatted text (e.g. text that appears bolded, colored or italicized)
string
a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format
a plain text file format in which values are separated by tabs
variable
a category of data being collected on the object being recorded (e.g. a mouse’s weight)