Reference – TBC Workshop Title

Key Points

Lesson Schedule
What is Version Control	Version control is like an unlimited ‘undo’. Version control also allows many people to work in parallel.
Setting Up Git	Use `git config` with the `--global` option to configure a user name, email address, editor, and other preferences once per machine. GitHub needs an SSH key to allow access
Creating a Repository	`git clone` creates a local copy of a repository from a URL. Git stores all of its repository data in the `.git` directory.
Tracking Changes	`git status` shows the status of a repository. Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded). `git add` puts files in the staging area. `git commit` saves the staged content as a new commit in the local repository. Write commit messages that accurately describe your changes. `git log --decorate` lists the commits made to the local repository, along with whether or not they are up-to-date with any remote repository.
Exploring History	`git diff` displays differences between commits. `git checkout` recovers old versions of files.
Remote Repositories	Git can easily synchronise your local repository with a remote one GitHub needs an SSH key to allow access Git can resolve ‘conflicting’ modifications to text files
Branching	Branches are parallel versions of a repository You can easily switch between branches, and merge their changes Branches help with code sharing and collaboration
Ignoring Things	The `.gitignore` file tells Git what files to ignore.
Survey
Reference
Lesson Schedule
Introduction	OpenRefine is a powerful and free, open source tool that can be used for data cleaning. OpenRefine will automatically track any steps you take in working with your data, and will leave your original data intact.
Opening and Exploring Data	Faceting can identify errors or outliers in data.
Transforming Data	Clustering can identify outliers in data and help us fix errors in bulk. GREL (General Refine Expression Language) is a powerful tool for transforming data.
Filtering and Sorting Data	OpenRefine provides various ways to sort and filter data without affecting the raw data.
Exporting Data Cleaning Steps	All changes are being tracked in OpenRefine (apart from individual cell changes and sorting!), and this information can be used for scripts for future analyses or reproducing an analysis. Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.
Exporting and Saving Data	Cleaned data or entire projects can be exported from OpenRefine. Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.
Further Resources on OpenRefine	Other examples and resources online are good for learning more about OpenRefine
Survey
Reference
Lesson Schedule
Introduction	Well-made software is easier to expand and reuse You need to produce reproducible research. You are a user of your own code.
Issues	Issues are a way of recording bugs or feature requests. Issues can be categorised by type. Issues can reference other issues, and be referenced by commits.
Project Management	Projects are broken-down into self-contained tasks. Tasks are represented as cards on a board. Cards are arranged to show their status. Issues can be added to project boards and labelled. Project boards can show the priority of their tasks. Forks are copies of entire repositories that can be synced up with the original.
Release Management	Releases are stable versions of the code. Zenodo can automatically generate DOIs for releases. Software licenses can restrict what others can do with your code.
Writing Sustainable Code	Always assume that someone else will read your code at a later date, including yourself. Rename variables and functions to add context to make your code more readable. Add comments to explain why something was done in a certain way if not obvious. Don’t add comments that just restate what code clearly already does. Use docstrings at the start of functions and files to explain their behaviour and input/output parameters.
Managing a Mini-Project	Problems with code and documentation can be tracked as issues. Issues can be managed on a project board. Issues can be fixed using the feature-branch workflow. Stable versions of the code can be published as releases.
Survey
Reference
Lesson Schedule
Python Basics	Start the python interpreter by typing `python` in the shell. Variables are named memory locations, they are used to access data.
Arrays, Lists etc	A list is an ordered collection of items of any type. Values in the list can be accessed using their index in square brackets e.g. my_list[ix] Lists can be manipulated in place using attribute functions e.g. my_list.reverse() Ranges of values in a list can be obtained via slicing e.g. mylist[start:stop]
Repeating actions using loops	We can use the ‘for in’ syntax to loop over collections or generators.
Processing data files	The python function `open` lets us read `r` or write `w` to files by creating a file handler. We can use string operations such as `line.split(',')` to process data in files.
Making choices	We can use logical operations to change the behavior of our code when it meets certain conditions. Using if, elif, and else we can check conditions and add a branch that runs if none of the conditions are met. We can combine conditions using `and` and `or` to make more complicated logical statements.
Modularising your code using functions	A function is created using the `def` keyword. Functions take variables that are specified in the function definition and use the `return` keyword to specify their output. We can use a module to keep our functions separate to the main body of our code to improve code readability.
Handling Errors	Python has built in error names that give a hint to the type of problem you are looking for. We can use the traceback to find which bit of the code threw an error.
Command-Line Programs	Python uses the `sys` library to acess command line arguments. `sys.argv` is a list of command line arguments. Python program outputs can be used in a pipeline, however, due to the way python works we need to use the `signal` library to make sure it handles piping output correctly.
Reading and analysing Patient data using libraries	Python has many libraries that add to the core language to improve functionality in specific use cases. Numpy is a numerical python library that makes working with vectors, matricies, or large data tables easier. Numpy can be used to load datasets directly from CSV files bypassing Pythons built in file systems.
Data Visualisation	We can use `matplotlib` to create and manipulate a wide variety of plots in Python. Once a plot has been made we can use matplotlib’s function `savefig` to output it in formats appropriate for publication.
Python Style Guide	Pep8 provides a guide for styling your python code.
Survey
Challenges
Why Python?
Reference
Lesson Schedule
Day 1: Starting with Data	Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics. Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more). Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value. A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters. Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test). Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv(). Data frames are a data structure for most tabular data, and what we use for statistics and plotting. It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers. Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.
Day 2: Manipulating Data	dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package. A subset of columns from a dataframe can be selected using select(). To choose rows based on a specific criterion, use filter(). Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. To create new columns based on the values in existing columns, use mutate(). Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions. Dates can be formatted using the package ‘lubridate’. To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package. Export data from a dataframe to a csv file using write_csv().
Day 3: Visualising Data	ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph. Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot. Faceting splits one plot into multiple plots based on a factor from the dataset. Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort. The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange(). Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.
Survey
Reference
Lesson Schedule
Introducing the Shell	The shell lets you define repeatable workflows. The shell is available on systems where graphical interfaces are not.
Files and Directories	The file system is responsible for managing information on the disk. Information is stored in files, which are stored in directories (folders). Directories can also store other directories, which then form a directory tree. `cd [path]` changes the current working directory. `ls [path]` prints a listing of a specific file or directory; `ls` on its own lists the current working directory. `pwd` prints the user’s current working directory. `/` on its own is the root directory of the whole file system. Most commands take options (flags) that begin with a `-`. A relative path specifies a location starting from the current location. An absolute path specifies a location from the root of the file system. Directory names in a path are separated with `/` on Unix, but `\` on Windows. `.` on its own means ‘the current directory’; `..`` means ‘the directory above the current one’. `--help` is an option supported by many bash commands, and programs that can be run from within Bash, to display more information on how to use these commands or programs. `man [command]` displays the manual page for a given command.
Creating Things	Command line text editors let you edit files in the terminal. You can open up files with either command-line or graphical text editors. `nano [path]` creates a new text file at the location `[path]`, or edits an existing one. `cat [path]` prints the contents of a file. `rmdir [path]` deletes an (empty) directory. `rm [path]` deletes a file, `rm -r [path]` deletes a directory (and contents!). `mv [old_path] [new_path]` moves a file or directory from `[old_path]` to `[new_path]`. `mv` can be used to rename files, e.g. `mv a.txt b.txt`. Using `.` in `mv` can move a file without renaming it, e.g. `mv a/file.txt b/.`. `cp [original_path] [copy_path]` creates a copy of a file at a new location.
Wildcards, Pipes and Filters	`wc` counts lines, words, and characters in its inputs. `` matches zero or more characters in a filename, so `.txt` matches all files ending in `.txt`. `?` matches any single character in a filename, so `?.txt` matches `a.txt` but not `any.txt`. `cat` displays the contents of its inputs. `sort` sorts its inputs. `head` displays the first 10 lines of its input. `tail` displays the last 10 lines of its input. `command > [file]` redirects a command’s output to a file (overwriting any existing content). `command >> [file]` appends a command’s output to a file. `[first] \| [second]` is a pipeline: the output of the first command is used as the input to the second. The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
Finding Things	`find` finds files with specific properties that match patterns. `grep` selects lines in files that match patterns. `$([command])` inserts a command’s output in place.
Shell Scripts	Save commands in files (usually called shell scripts) for re-use. `bash [filename]` runs the commands saved in a file. `$@` refers to all of a shell script’s command-line arguments. `$1`, `$2`, etc., refer to the first command-line argument, the second command-line argument, etc. Use `Ctrl`+`R` to search through the previously entered commands. Use `history` to display recent commands, and `![number]` to repeat a command by number. Place variables in quotes if the values might have spaces in them. Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.
Loops	A `for` loop repeats commands once for every thing in a list. Every `for` loop needs a variable to refer to the thing it is currently operating on. Use `$name` to expand a variable (i.e., get its value). `${name}` can also be used. Do not use spaces, quotes, or wildcard characters such as ‘*’ or ‘?’ in filenames, as it complicates variable expansion. Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
Additional Exercises	`date` prints the current date in a specified format. Scripts can save the output of a command to a variable using `$(command)` `basename` removes directories from a path to a file, leaving only the name `cut` lets you select specific columns from files, with `-d','` letting you select the column separator, and `-f` letting you select the columns you want.
Survey
Reference
Lesson Schedule
Introduction	Good data organisation is the foundation of any research project.
Organising data in spreadsheets	Never modify your raw data. Always make a copy before making any changes. Keep track of all of the steps you take to clean your data in a plain text file. Organise your data according to tidy data principles. Record metadata in a separate plain text file (such as README.txt) in your project root folder or folder with data.
Common spreadsheet errors	Include only one piece of information in a cell. Avoid using multiple tables or spreading data about multiple tabs within one spreadsheet. Record zeros as zeros. Avoid spaces, numbers and special characters in column headers. Avoid special characters in your data. Use an appropriate null value to record missing data. Record units in column headers. Place comments in a separate column. Do not use formatting to convey information.
Dates as data	Use extreme caution when working with date data. Splitting dates into their component values can make them easier to handle.
Quality assurance and control	Always copy your original spreadsheet file and work with a copy so you do not affect the raw data. Use data validation to prevent accidentally entering invalid data.
Exporting data	Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data. Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.
Survey
Reference

Glossary

cleaned data: data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting: formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format: a plain text file format in which values are separated by commas
factor: a variable that takes on a limited number of possible values (i.e. categorical data)
metadata: data which describes other data
null value: a value used to record observations missing from a dataset
observation: a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text: unformatted text
quality assurance: any process which checks data for validity during entry
quality control: any process which removes problematic data from a dataset
raw data: data that has not been manipulated and represents actual recorded values
rich text: formatted text (e.g. text that appears bolded, colored or italicized)
string: a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format: a plain text file format in which values are separated by tabs
variable: a category of data being collected on the object being recorded (e.g. a mouse’s weight)