Link Search Menu Expand Document

P2D2: Functions and automation

There are no routine statistical questions, only questionable statistical routines.

Sir David Cox

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

John Tukey

Let’s pick an article

Readings

Understanding joins

The Relational Data chapter of R for Data Science provides clear details on joins in R.

Join for our project

dat <- read_csv("https://github.com/fivethirtyeight/guns-data/raw/master/full_data.csv") %>%
    select(-X1)

dat_counts <- dat %>%
    count(race, year)
#' Used this information to build the values.
# https://www.census.gov/quickfacts/fact/table/US/POP010220
dat_pop <- tibble(
    table_var = c("Asian/Pacific Islander",  
        "Black",  "Hispanic",  
        "Native American/Native Alaskan",  "White"), 
    N =  331449281 *c(.061, .134, .185, .013, .763))

Waffle Chart Example

library(tidyverse)
library(waffle)

httpgd::hgd()
httpgd::hgd_browse()

storms %>% 
  filter(year >= 2010) %>% 
  count(year, status) -> storms_df

ggplot(storms_df, aes(fill = status, values = n)) +
  geom_waffle(color = "white", size = .25, n_rows = 10, flip = TRUE) +
  facet_wrap(~year, nrow = 1, strip.position = "bottom") +
  scale_x_discrete() + 
  scale_y_continuous(labels = function(x) x * 10, # make this multiplyer the same as n_rows
                     expand = c(0,0)) +
  ggthemes::scale_fill_tableau(name=NULL) +
  coord_equal() +
  labs(
    title = "Faceted Waffle Bar Chart",
    subtitle = "{dplyr} storms data",
    x = "Year",
    y = "Count"
  ) +
  theme_minimal(base_family = "Roboto Condensed") +
  theme(panel.grid = element_blank(), axis.ticks.y = element_line()) +
  guides(fill = guide_legend(reverse = TRUE))

Using Heatmaps to display data

library(tidyverse)
library(ggfittext)

httpgd::hgd()
httpgd::hgd_browse()

storms %>% 
  count(year, status) -> storms_df_all

storms_df_all %>%
    ggplot(aes(x = year, y = status, fill = n)) +
    geom_tile()  

Getting data for 2016-2019

Let’s look at the CDC_parser.R script.

Configuring our .gitignore

This file is a part of the secret sauce of Git. Pluralsight provides a clean description for our conversation.

In our case, we want to ignore the data/ folder to limit large file issues with our repository.

Leveraging functions

You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). functions R for Data Science

Fixing their CDC_parser() function

  1. add a folder_path argument.
  2. document our function with roxygen2
#' Pull CDC Death data
#' @param

Understanding temp files and downloads

temp <- tempfile()
download.file(url, temp, quiet = T)