P1D2: Visualization and the Tidyverse
Above all else, show the data Tufte
Visualization facilitates every step of data science
“The simple graph has brought more information to the data analyst’s mind than any other device.” John Tukey
“The greatest value of a picture is when it forces us to notice what we never expected to see.” John Tukey
The grammar of graphics
R for data science
A powerful programming language is more than just a means for instructing a computer to perform tasks. The language also serves as a framework within which we organize our ideas about processes. Thus, when we describe a language, we should pay particular attention to the means that the language provides for combining simple ideas to form more complex ideas. Every powerful language has three mechanisms for accomplishing this:
- primitive expressions, which represent the simplest entities the language is concerned with,
- means of combination, by which compound elements are built from simpler ones, and
- means of abstraction, by which compound elements can be named and manipulated as units. Harold Abelson, Structure and Interpretation of Computer Programs
The pipe %>%
You can read it as a series of imperative statements: group, then summarize, then filter. As the reading suggests, a good way to pronounce %>% when reading code is “then”.
- Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.
- You can use the pipe to rewrite multiple operations so that you can read left-to-right, top-to-bottom.
- We’ll use piping frequently from now on because it considerably improves the readability of code.
You can read more about pipes in R for Data Science.
Which mechanism is the %>%
?
dplyr
Which mechanism is the library(dplyr)
?
The standard table mungers
filter()
- filter your data to a smaller set of important rows.arrange()
- Organize the row order of my dataselect()
- select specific columns to keep or removemutate()
- add new mutated (changed) variables as columns to my data.
The new table creators
group_by()
- divide your data into groups. Often used withsummarise()
summarise()
- build summaries of the columns specified
Practice reading code
With your table, write this code out in an English paragraph.
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
Practice using dplyr
Use filter()
, arrange()
, select()
, mutate()
, group_by()
, and summarise()
. With library(tidyverse)
tackle the following challenges.
- Arrange the
iris
data bySepal.Length
and display the first six rows.- Select the
Species
andPetal.Width
columns and put them into a new data set calledtestdat
.- Create a new table that has the mean for each variable by Species.
- Read about the
?summarise_all()
function and get a new table with the means and standard deviations for each Species.- Look at the examples in the
summarise_all()
help file and see if you can find other use cases for thesummarise_
ormutate_
functions.
The grammar of ggplot2
ggplot2 and iris data
Use the iris
data to show a faceted visualization with a color
, shape
, and size
layer or geometry.
Debugging code and asking questions
What is wrong?
After the first line, look at each command and find the error.
library(tidyverse)
ggplot(dota = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8)
filter(diamond, carat > 3)
data.frame(1:10,10:1,)
What is R telling me?
Run each line above, and let’s discuss what the error tells us.
Asking questions!
There are two ways to write error-free programs; only the third one works. Alan Perlis
We need to find the balance between learning and wasting time. Learning when to ask questions is an essential tool in this process. If you ask someone to solve your problem too early, you don’t understand. However, if you spend hours spinning your wheels, it may be worse.