Day 1: Starting with Data
|
Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics.
Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more).
Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value.
A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters.
Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test).
Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv().
Data frames are a data structure for most tabular data, and what we use for statistics and plotting.
It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers.
Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.
|
Day 2: Manipulating Data
|
dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package.
A subset of columns from a dataframe can be selected using select().
To choose rows based on a specific criterion, use filter().
Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.
To create new columns based on the values in existing columns, use mutate().
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions.
Dates can be formatted using the package ‘lubridate’.
To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package.
Export data from a dataframe to a csv file using write_csv().
|
Day 3: Visualising Data
|
ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame.
Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph.
Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot.
Faceting splits one plot into multiple plots based on a factor from the dataset.
Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.
The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange().
Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.
|
Survey
|
|
{:auto_ids}
key word 1
: explanation 1
key word 2
: explanation 2