Introductory Data Management with R: Glossary

Key Points

Day 1: Starting with Data
  • Although R has a steeper learning curve than some other data analysis software, R has many advantages - R is interdisciplinary, extensible, great for data wrangling and reproducibility, and produces high quality graphics.

  • Values can be assigned to objects, which have a number of attributes. Objects can then be used in arithmetic operations (and more).

  • Functions automate sets of commands, many are predefined but it’s also possible to write your own. Functions usually take one or more inputs (called arguments) and often return a value.

  • A vector is the most common and basic data structure in R. A vector is composed of a series of values, which can be either numbers or characters.

  • Vectors can be subset by providing one or several indices in square brackets or by using a logical vector (often the output of a logical test).

  • Missing data are represented in vectors as NA. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values. - CSV files can be read in using read.csv().

  • Data frames are a data structure for most tabular data, and what we use for statistics and plotting.

  • It is possible to subset dataframes by specifying the coordinates in square brackets. Row numbers come first, followed by column numbers.

  • Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. Factors can only contain a pre-defined set of values, known as levels.

Day 2: Manipulating Data
  • dplyr is a package for making tabular data manipulation easier and tidyr reshapes data so that it is in a convenient format for plotting or analysis. They are both part of the tidyverse package.

  • A subset of columns from a dataframe can be selected using select().

  • To choose rows based on a specific criterion, use filter().

  • Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset.

  • To create new columns based on the values in existing columns, use mutate().

  • Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. This can be achieved using the group_by() and summarize() functions.

  • Dates can be formatted using the package ‘lubridate’.

  • To reshape data between wide and long formats, use pivot_wider() and pivot_longer() from the tidyr package.

  • Export data from a dataframe to a csv file using write_csv().

Day 3: Visualising Data
  • ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame.

  • Define an aesthetic mapping (using the aes function), by selecting the variables to be plotted and specifying how to present them in the graph.

  • Add ‘geoms’ – graphical representations of the data in the plot using geom_point() for a scatter plot, geom_boxplot() for a boxplot, and geom_line() for a line plot.

  • Faceting splits one plot into multiple plots based on a factor from the dataset.

  • Every single component of a ggplot graph can be customized using the generic theme() function. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.

  • The gridExtra package allows us to combine separate ggplots into a single figure using grid.arrange().

  • Use ggsave() to save a plot and edit the arguments (height, width, dpi) to change the dimension and resolution.

Survey

Glossary

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1
explanation 1
key word 2
explanation 2