Data manipulation, pre-processing and cleaning might not be the most glamourous component of data analysis, but is nonetheless essential to any statistical (or machine learning!) analysis. Being able to efficiently and neatly process data is thus an important topic, for which we try here to provide a few basic elements — focusing on its implementation with R.
We will focus here on:
The package purrr
is part of the tidyverse
suit (and is consequently automatically downloaded by calling the tidyverse
). purrr
handles tasks similar to
ones performed by apply-family functions in base R through a family of “map” functions (see details below).
It enhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectors.
map
-functions allow you to replace many for loops with code that is easier
to read. purrr
functions are implemented in C, which can also make it more efficient that manual loops over the data.
For instance:
map()
, map_if()
, map_at()
returns a listmap_lgl()
returns a logical vector,map_int()
returns a integer vector,map_dbl()
returns a double vector,map_chr()
returns a character vector,map_dfr()
, map_dfc()
returns a data.frame
by binding rows or columns respectively.Example: column-wise mean
df <- tibble(a=rnorm(10), b=rnorm(10), c=rnorm(10), d=rnorm(10))
map_dbl(df, mean) # or equivalently: df %>% map_dbl(mean)
The focus here is on the operation being performed, not the book-keeping. The second argument, .f
, can be a functions, a formula,
a character vector, or an integer vector.
map(1:3, ~ rnorm(7, .x))
map
can pass additional parameters to the function:
map_dbl(df, mean, trim = 0.25)
mtcars %>%
split(.$cyl)
mtcars %>%
split(.$cyl) %>%
map_df(dim)
purrr
does some of the same than the apply
family. However, purrr
is more consistent, so you should learn it.
A quick reference of similar base R functions:
lapply is basically identical to map.
sapply is a wrapper around lapply and it tries to simplify the output. Downside: you never know what you’ll get.
vapply: like sapply, but you can supply an additional argument that defines the type.
You can learn more about purrr
here: (http://r4ds.had.co.nz/iteration.html).
There are two main types of missingness: explicit and implicit, as showcased in the following example.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
The way that a dataset is represented can make implicit values explicit.
stocks %>% spread(year, return)
Recall the functions we learned from tidyr
package.
You can used spread()
and gather()
to retain only non-missing recored,
i.e. to turn all explicit missing values into implicit ones.
stocks %>% spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm = TRUE)
complete()
takes a set of columns, and finds all unique combinations.
It then ensures the original dataset contains all those values, filling
in explicit NA
s where necessary.
stocks %>% complete(year, qtr)
NA
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
# tribble() constructs a tibble by filling by rows
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
You can fill in these missing values with fill()
treatment %>% fill(person)
Rarely does a data analysis involve only a single table of data.
Collectively, multiple tables of data are called relational data because the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables.
All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.
Example: The nycflights13
package contains a collection of related datasets.
library(nycflights13)
Source: (http://r4ds.had.co.nz/relational-data.html)
A key is a variable (or set of variables) that uniquely identifies an observation.
For example, each plane is uniquely determined by its tailnum, but an
observation in ‘weather’ is identified by five variables: year, month, day
,
hour, and origin
Keys can be used to connect each pair of tables together.
There are two types of keys:
Primary: identifies an observation in its own table. Example:
planes$tailnum
Foreign: identifies an observation in another table. Example:
flights$tailnum
, this is because tailnum
does not enough to identify
a record in flights
dataset.
A variable can be both a primary key and a foreign key.
It’s good practice to verify that chosen keys do indeed uniquely identify each observation.
One way to do that is to count()
the primary keys and look for entries where
n
is greater than one:
planes %>%
count(tailnum) %>%
filter(n > 1)
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
Sometimes a table doesn’t have an explicit primary key, e.g.
in flights
dataset each row is an observation, but no combination of
variables reliably identifies it, (even the flight numbers).
In this case, you can add an extra identifier
column:
flights %>%
count(flight) %>%
filter(n > 1)
or
flights %>%
mutate(flight_id= paste0("F", row_number())) %>%
select(flight_id, year:flight)
There are three families of functions designed to merge relational data:
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations, which treat observations as if they were set elements.
A mutating join allows you to combine variables from two tables, by matching observations by their keys, and then copying across variables from one table to the other. e.g.
flights %>%
select(year:day, hour, origin, dest, tailnum, carrier) %>%
left_join(airlines, by = "carrier")
There are four mutating join functions:
inner_join()
outer joins;
left_join()
right_join()
full_join()
A simple example:
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x %>% inner_join(y, by = "key")
Source: (http://r4ds.had.co.nz/relational-data.html)
What happens when there are duplicate keys?
Source: http://r4ds.had.co.nz/relational-data.html
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables.
There are two types:
semi_join(x, y)
keeps all observations in x that have a match in y.anti_join(x, y)
drops all observations in x that have a match in y.In filtering joins, only the existence of a match is important.
It doesn’t matter which observation is matched.
Filtering joins never duplicate rows like mutating joins do:
Set operations apply to rows; they expect the x
and y
inputs to have
the same variables, and treat the observations like sets.
intersect(x, y)
: returns only observations in both x and y.
union(x, y)
: returns unique observations in x and y.
setdiff(x, y)
: returns observations in x, but not in y.
All these operations work with a complete row, comparing the values of every
variable.
Examples:
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
After working with a dataset and doing all data manipulation, you might want to save your new data table.
Recall the readr
package. Besides functions for reading data in, readr
has utilities for saving your data to a text file:
write_tsv(mydata, "path/to/filename.tsv") # tab-delimited
write_csv(mydata, "path/to/filename.csv") # comma-delimited
write_delim(mydata, "path/to/filename.csv", delim = " ") # general delimiter
To save your data in other types of files, you need to install and use other packages:
xlsx
package, and follow
this guide.# install.packages(xlsx)
library(xlsx)
write.xlsx(mydata, "path/to/filename.xlsx")
haven
package.# install.packages(haven)
library(haven)
read_sas("mtcars.sas7bdat")
write_sas(mtcars, "mtcars.sas7bdat")
You can also choose to save all objects currently in the workspace
(variables, functions, etc.) into a file e.g. filename.rda
.
The file filename.rda
can be the easily loaded next time you work with R.
You can also save a single object or a subset of specified objects currently in the workspace.
# save the workspace to file
save.image(file = "path/to/filename.rda")
# save specific objects to a file
save(object_list, file = "path/to/filename.rda")
# save just a single object
saveRDS(object, file = "path/to/filename.rds")
# load a workspace into the current session
load("path/to/filename.rda")
# read just the previously saved 1 object
object <- readRDS("path/to/filename.rds")
Based on the R-bootcamp class by Lan Huong Nguyen (Stanford), edited by Claire Donnat.