There are no routine statistical questions, only questionable statistical routines. — Sir David Cox
EDA is an iterative process:
Use what you learn to refine your questions or generate new ones.
Your goal during EDA is to develop an understanding of your data.
EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.1
Two types of questions will always be useful for making discoveries within your data:
Some comments about EDA:
Variation is the tendency of the values of a variable to change from measurement to measurement. Every variable has its own pattern of variation, which can reveal interesting information.2
Recall the diamonds
dataset. Use a bar chart, to examine the distribution of a categorical variable, and a histogram that of a continuous one.
Look for anything unexpected!
Outliers are observations that are unusual – data points that don’t seem to fit the general pattern.
Sometimes outliers are data entry errors; other times outliers suggest important new science.
Now that we have seen the usual values, we can try to understand them.
## # A tibble: 9 x 5
## price carat x y z
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 5139 1 0 0 0
## 2 6381 1.14 0 0 0
## 3 12800 1.56 0 0 0
## 4 15686 1.2 0 0 0
## 5 18034 2.25 0 0 0
## 6 2130 0.71 0 0 0
## 7 2130 0.71 0 0 0
## 8 2075 0.51 5.15 31.8 5.12
## 9 12210 2 8.09 58.9 8.06
The y variable measures the length (in mm) of one of the three dimensions of a diamond.
Therefore, these must be entry errors! Why?
It’s good practice to repeat your analysis with and without the outliers.
Covariation is the tendency for the values of two or more variables to vary together in a related way.
Boxplot are used to display visual shorthand for a distribution of a continuous variable broken down by categories.
They mark the distribution’s quartiles.
Use a boxplot or a violin plot to display the covariation between a categorical and a continuous variable.
Violin plots give more information, as they show the entrire estimated distribution.
To visualise the covariation between categorical variables, you need to count the number of observations for each combination, e.g. using geom_count()
:
Another approach is to first, compute the count and then visualise it by coloring with geom_tile()
and the fill aesthetic:
plotly
packageplotly
is a package for visualization and a collaboration platform for data scienceplotly
integration with ggplot2
plt <- ggplot(diamonds %>% sample_n(1000), aes(x = carat, y = price)) +
geom_text(aes(label = clarity), size = 4) +
geom_smooth(aes(color = cut, fill = cut)) +
facet_wrap(~cut)
ggplotly(plt)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
dbl.helix <- data.frame(t = rep(seq(0, 2*pi, length.out = 100), 3)) %>%
mutate(x1 = sin(t), y1 = cos(t), z = (1:length(t))/10,
x2 = sin(t + pi/2), y2 = cos(t + pi/2))
plot_ly(dbl.helix, x = ~x1, y = ~y1, z = ~z, type = "scatter3d", mode = "lines",
color = "green", colors = c("green", "purple"), line = list(width = 5)) %>%
add_trace(x = ~x2, y = ~y2, z = ~z+0.2, color = "purple")
volcano
- a built-in dataset storing topographic information for Maunga Whau (Mt Eden), one of 50 volcanos in Auckland, New Zealand.## [1] 87 61
## [,1] [,2] [,3] [,4] [,5]
## [1,] 100 100 101 101 101
## [2,] 101 101 102 102 102
## [3,] 102 102 103 103 103
## [4,] 103 103 104 104 104
## [5,] 104 104 105 105 105