P2D2: Functions and automation
There are no routine statistical questions, only questionable statistical routines.
Sir David Cox
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
John Tukey
Let’s pick an article
Understanding joins
The Relational Data chapter of R for Data Science provides clear details on joins in R.
- Mutating joins are the primary joins and I almost always use a dplyr::left_join(). A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
- Understanding types of joins
- Problems with duplicate keys
- Mapping terminology
Join for our project
dat <- read_csv("https://github.com/fivethirtyeight/guns-data/raw/master/full_data.csv") %>%
select(-X1)
dat_counts <- dat %>%
count(race, year)
#' Used this information to build the values.
# https://www.census.gov/quickfacts/fact/table/US/POP010220
dat_pop <- tibble(
table_var = c("Asian/Pacific Islander",
"Black", "Hispanic",
"Native American/Native Alaskan", "White"),
N = 331449281 *c(.061, .134, .185, .013, .763))
Waffle Chart Example
library(tidyverse)
library(waffle)
httpgd::hgd()
httpgd::hgd_browse()
storms %>%
filter(year >= 2010) %>%
count(year, status) -> storms_df
ggplot(storms_df, aes(fill = status, values = n)) +
geom_waffle(color = "white", size = .25, n_rows = 10, flip = TRUE) +
facet_wrap(~year, nrow = 1, strip.position = "bottom") +
scale_x_discrete() +
scale_y_continuous(labels = function(x) x * 10, # make this multiplyer the same as n_rows
expand = c(0,0)) +
ggthemes::scale_fill_tableau(name=NULL) +
coord_equal() +
labs(
title = "Faceted Waffle Bar Chart",
subtitle = "{dplyr} storms data",
x = "Year",
y = "Count"
) +
theme_minimal(base_family = "Roboto Condensed") +
theme(panel.grid = element_blank(), axis.ticks.y = element_line()) +
guides(fill = guide_legend(reverse = TRUE))
Using Heatmaps to display data
library(tidyverse)
library(ggfittext)
httpgd::hgd()
httpgd::hgd_browse()
storms %>%
count(year, status) -> storms_df_all
storms_df_all %>%
ggplot(aes(x = year, y = status, fill = n)) +
geom_tile()
Getting data for 2016-2019
Let’s look at the CDC_parser.R script.
Configuring our .gitignore
This file is a part of the secret sauce of Git. Pluralsight provides a clean description for our conversation.
In our case, we want to ignore the data/
folder to limit large file issues with our repository.
Leveraging functions
You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). functions R for Data Science
Fixing their CDC_parser()
function
- add a
folder_path
argument. - document our function with roxygen2
#' Pull CDC Death data
#' @param
Understanding temp files and downloads
temp <- tempfile()
download.file(url, temp, quiet = T)