P1D2: Visualization and the Tidyverse

Above all else, show the data Tufte

Visualization facilitates every step of data science

“The simple graph has brought more information to the data analyst’s mind than any other device.” John Tukey

“The greatest value of a picture is when it forces us to notice what we never expected to see.” John Tukey

The grammar of graphics

Introduction to the grammar

R for data science

A powerful programming language is more than just a means for instructing a computer to perform tasks. The language also serves as a framework within which we organize our ideas about processes. Thus, when we describe a language, we should pay particular attention to the means that the language provides for combining simple ideas to form more complex ideas. Every powerful language has three mechanisms for accomplishing this:

primitive expressions, which represent the simplest entities the language is concerned with,

means of combination, by which compound elements are built from simpler ones, and

means of abstraction, by which compound elements can be named and manipulated as units. Harold Abelson, Structure and Interpretation of Computer Programs

The pipe `%>%`

You can read it as a series of imperative statements: group, then summarize, then filter. As the reading suggests, a good way to pronounce %>% when reading code is “then”.

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on.
You can use the pipe to rewrite multiple operations so that you can read left-to-right, top-to-bottom.
We’ll use piping frequently from now on because it considerably improves the readability of code.

You can read more about pipes in R for Data Science.

Which mechanism is the %>%?

dplyr

Which mechanism is the library(dplyr)?

The standard table mungers

filter() - filter your data to a smaller set of important rows.

arrange() - Organize the row order of my data

select() - select specific columns to keep or remove

mutate() - add new mutated (changed) variables as columns to my data.

The new table creators

group_by() - divide your data into groups. Often used with summarise()

summarise() - build summaries of the columns specified

Practice reading code

With your table, write this code out in an English paragraph.

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

Practice using dplyr

Use filter(), arrange(), select(), mutate(), group_by(), and summarise(). With library(tidyverse) tackle the following challenges.

Arrange the iris data by Sepal.Length and display the first six rows.

Select the Species and Petal.Width columns and put them into a new data set called testdat.

Create a new table that has the mean for each variable by Species.

Read about the ?summarise_all() function and get a new table with the means and standard deviations for each Species.

Look at the examples in the summarise_all() help file and see if you can find other use cases for the summarise_ or mutate_ functions.

The grammar of ggplot2

Introduction to ggplot2

ggplot2 and iris data

Use the iris data to show a faceted visualization with a color, shape, and size layer or geometry.

Debugging code and asking questions

What is wrong?

After the first line, look at each command and find the error.

library(tidyverse)

ggplot(dota = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8)
filter(diamond, carat > 3)
data.frame(1:10,10:1,)

What is R telling me?

Run each line above, and let’s discuss what the error tells us.

Asking questions!

There are two ways to write error-free programs; only the third one works. Alan Perlis

We need to find the balance between learning and wasting time. Learning when to ask questions is an essential tool in this process. If you ask someone to solve your problem too early, you don’t understand. However, if you spend hours spinning your wheels, it may be worse.