Exploratory Data Analysis

Based on CME/STATS 195 course at Stanford University by Lan Huong Nguyen

Exploratory data analysis

What is exploratory data analysis (EDA)?

There are no routine statistical questions, only questionable statistical routines. — Sir David Cox

EDA is an iterative process:

Generate questions about your data
Search for answers by visualising, transforming, and modelling data

Use what you learn to refine your questions or generate new ones.

Asking questions

Your goal during EDA is to develop an understanding of your data.

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.¹

Two types of questions will always be useful for making discoveries within your data:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

Some comments about EDA:

It is not a formal process with a strict set of rules.
Explore many ideas: some will pan out, others will be dead ends.
Even if questions are predefined, quality of data still needs to be assessed

Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. Every variable has its own pattern of variation, which can reveal interesting information.²

Recall the diamonds dataset. Use a bar chart, to examine the distribution of a categorical variable, and a histogram that of a continuous one.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Identifying typical values

Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?

diamonds %>% filter(carat < 3) %>%
  ggplot(aes(x = carat)) + geom_histogram(binwidth = 0.01)

Look for anything unexpected!

Identify outliers

Outliers are observations that are unusual – data points that don’t seem to fit the general pattern.

Sometimes outliers are data entry errors; other times outliers suggest important new science.

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

Identifying outliers

Now that we have seen the usual values, we can try to understand them.

diamonds %>% filter(y < 3 | y > 20) %>% 
  select(price, carat, x, y, z) %>% arrange(y)

## # A tibble: 9 x 5
##   price carat     x     y     z
##   <int> <dbl> <dbl> <dbl> <dbl>
## 1  5139  1     0      0    0   
## 2  6381  1.14  0      0    0   
## 3 12800  1.56  0      0    0   
## 4 15686  1.2   0      0    0   
## 5 18034  2.25  0      0    0   
## 6  2130  0.71  0      0    0   
## 7  2130  0.71  0      0    0   
## 8  2075  0.51  5.15  31.8  5.12
## 9 12210  2     8.09  58.9  8.06

The y variable measures the length (in mm) of one of the three dimensions of a diamond.

Therefore, these must be entry errors! Why?

It’s good practice to repeat your analysis with and without the outliers.

Covariation

Covariation is the tendency for the values of two or more variables to vary together in a related way.

ggplot(data = diamonds) + 
  geom_point(aes(x=carat, y=price), alpha=0.1)

Boxplots

Boxplot are used to display visual shorthand for a distribution of a continuous variable broken down by categories.

They mark the distribution’s quartiles.

A categorical and a continuous variable

Use a boxplot or a violin plot to display the covariation between a categorical and a continuous variable.

Violin plots give more information, as they show the entrire estimated distribution.

ggplot(mpg, aes(
  x = reorder(class, hwy, FUN = median), y = hwy)) +
  geom_boxplot() + coord_flip()

ggplot(mpg, aes(
  x = reorder(class, hwy, FUN = median), y = hwy)) +
  geom_violin() + coord_flip()

Two categorical variables

To visualise the covariation between categorical variables, you need to count the number of observations for each combination, e.g. using geom_count():

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

Another approach is to first, compute the count and then visualise it by coloring with geom_tile() and the fill aesthetic:

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n)) +
  scale_fill_viridis()

Two continuous variables

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price)) + 
  scale_y_log10() + scale_x_log10()

# install.packages("hexbin")
ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = carat, y = price)) +
  scale_y_log10() + scale_x_log10()

Interactive graphics

The `plotly` package

plotly is a package for visualization and a collaboration platform for data science
Available in R, python, MATLAB, scala.
You can produce interactive graphics including 3D plots (with zooming and rotating).
You can open a ‘plotly’ account to upload ‘plotly’ graphs and view or modify them in a web browser.
Resources: cheatsheet, book

`plotly` integration with `ggplot2`

library(plotly); library(tidyverse) # or library(ggplot2); library(dplyr)
plt <- ggplot(diamonds %>% sample_n(1000), aes(x = carat, y = price)) + 
  geom_point(aes(color = cut))
ggplotly(plt)

plt <- ggplot(diamonds %>% sample_n(1000), aes(x = carat, y = price)) + 
  geom_text(aes(label = clarity), size = 4) + 
  geom_smooth(aes(color = cut, fill = cut)) +
  facet_wrap(~cut)
ggplotly(plt)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3D Scatter plots

theta <- seq(0, 10, 0.2); 
df <- data.frame(U = theta, V =  cos(theta), W = sin(theta)*theta)
plot_ly(df, x = ~V, y = ~W, z = ~U, type = "scatter3d", mode = "markers",
        marker = list(size = 3))

df$cols <- rep_len(c("orange", "blue", "green"), length.out = length(theta))
(plt <- plot_ly(df, x = ~V, y = ~W, z = ~U, color = ~cols,
        type = "scatter3d", mode = "markers+lines",
        marker = list(size = 5), line = list(width = 5)))

Adding layers

dbl.helix <- data.frame(t = rep(seq(0, 2*pi, length.out = 100), 3)) %>%
  mutate(x1 = sin(t), y1 = cos(t), z = (1:length(t))/10,
         x2 = sin(t + pi/2), y2 = cos(t + pi/2))
plot_ly(dbl.helix, x = ~x1, y = ~y1, z = ~z, type = "scatter3d", mode = "lines", 
        color = "green", colors = c("green", "purple"), line = list(width = 5)) %>% 
  add_trace(x = ~x2, y = ~y2, z = ~z+0.2, color = "purple")

Volcano dataset

volcano - a built-in dataset storing topographic information for Maunga Whau (Mt Eden), one of 50 volcanos in Auckland, New Zealand.
It consist of a 87 x 61 matrix with entries corresponding to the mountain’s atlitutes [m] on a 10m by 10m grid.
rows run east to west, and columns south to north

dim(volcano)

## [1] 87 61

volcano[1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]  100  100  101  101  101
## [2,]  101  101  102  102  102
## [3,]  102  102  103  103  103
## [4,]  103  103  104  104  104
## [5,]  104  104  105  105  105

2D contour plots

plot_ly(z = volcano) %>% add_contour()

3D surface plots

plot_ly(z = volcano) %>% add_surface()