P4D3: From the Tidyverse to Tidymodels

install.packages("tidyverse")
install.packages("tidymodels")
install.packages("visdat")

First we need to get our data built for our ML model

We should ..

download the .csv data. It is small enough that we can version it with our code.
Start our data_process.r script and import our data.

Our setup

library(tidyverse)
library(tidymodels)
library(visdat)

httpgd::hgd()
httpgd::hgd_browse()

Our initial import

dat <- read_csv("SalesBook_2013.csv") %>%
    select(NBHD, PARCEL, LIVEAREA, FINBSMNT,
        BASEMENT, YRBUILT, CONDITION, QUALITY,
        TOTUNITS, STORIES, GARTYPE, NOCARS,
        NUMBDRM, NUMBATHS, ARCSTYLE, SPRICE,
        DEDUCT, NETPRICE, TASP, SMONTH,
        SYEAR, QUALIFIED, STATUS) %>%
    rename_all(str_to_lower) %>%
    filter(
        totunits <= 2,
        yrbuilt != 0,
        condition != "None")

Let’s clean our data and establish our machine learning data

We need to do the following;

Create our target variable (R uses factors not 0/1).
Handle gartype and arcstyle nominal variables.
Handle quality and condition ordinal variables.
Create our garage type varaibles - attachedGarage, detachedGarage, carportGarage, and noGarage.
Remove duplicated parcels.
Drop nbhd, parcel, status, qualified, gartype and yrbuilt columns
Fix columns with missing values.
One-hot-encode arcstyle.

As we move through this process let’s leverage visdat::vis_dat()

Here are some useful snippets.

# quality case_when
quality = case_when(
    quality == "E-" ~ -0.3, quality == "E" ~ 0,
    quality == "E+" ~ 0.3, quality == "D-" ~ 0.7, 
    quality == "D" ~ 1, quality == "D+" ~ 1.3,
    quality == "C-" ~ 1.7, quality == "C" ~ 2,
    quality == "C+" ~ 2.3, quality == "B-" ~ 2.7,
    quality == "B" ~ 3, quality == "B+" ~ 3.3,
    quality == "A-" ~ 3.7, quality == "A" ~ 4,
    quality == "A+" ~ 4.3, quality == "X-" ~ 4.7,
    quality == "X" ~ 5, quality == "X+" ~ 5.3)

# condition case_when
condition = case_when(
    condition == "Excel" ~ 3,
    condition == "VGood" ~ 2,
    condition == "Good" ~ 1,
    condition == "AVG" ~ 0,
    condition == "Avg" ~ 0,
    condition == "Fair" ~ -1,
    condition == "Poor" ~ -2)

colMeans(select(dat, nocars, numbdrm, numbaths), na.rm = TRUE)

Now we can leverage step_dummy().

dat_ml <- dat %>%
    recipe(before1980 ~ ., data = dat) %>%
    step_dummy(arcstyle) %>%
    prep() %>%
    juice()

glimpse(dat_ml)

Now we can save our prepped data.

write_rds(dat_ml, "dat_ml.rds")
# arrow::write_feather(dat_ml, "dat_ml.feather")

Tidymodels and our model development

We need a few more R packages as we use tidymodels.

install.packages("tidymodels")
install.packages("DALEX")
install.packages("discrim")
install.packages("naivebayes")
install.packages("vip")
install.packages("xgboost")
install.packages("patchwork")
install.packages("GGally")

_Now we can start our machine learning script - model.r

library(tidyverse)
library(tidymodels)
library(DALEX)
library(skimr)
library(GGally)
library(xgboost)
library(vip)
library(patchwork)

httpgd::hgd()
httpgd::hgd_browse()

dat_ml <- read_rds("dat_ml.rds")

Training and Testing Data

The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set, but this package does not include code for modeling or calculating statistics.

set.seed(76)
dat_split <- initial_split(dat_ml, prop = 2/3, strata = before1980)

dat_train <- training(dat_split)
dat_test <- testing(dat_split)

Model fit

The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

bt_model <- boost_tree() %>%
    set_engine(engine = "xgboost") %>%
    set_mode("classification") %>%
    fit(before1980 ~ ., data = dat_train)

logistic_model <- logistic_reg() %>%
    set_engine(engine = "glm") %>%
    set_mode("classification") %>%
    fit(before1980 ~ ., data = dat_train)

nb_model <- discrim::naive_Bayes() %>%
    set_engine(engine = "naivebayes") %>%
    set_mode("classification") %>%
    fit(before1980 ~ ., data = dat_train)

Feature importance

vip is an R package for constructing variable importance plots (VIPs). VIPs are part of a larger framework referred to as interpretable machine learning (IML)

vip(bt_model, num_features = 20) + vip(logistic_model, num_features = 20)

Evaluating our predictions

yardstick is a package to estimate how well models are working using tidy data principles.

yardstick

First, we can build our prediction datasets.

preds_logistic <- bind_cols(
    predict(logistic_model, new_data = dat_test),
    predict(logistic_model, dat_test, type = "prob"),
    truth = pull(dat_test, before1980)
  )

# takes a minute
preds_nb <- bind_cols(
    predict(nb_model, new_data = dat_test),
    predict(nb_model, dat_test, type = "prob"),
    truth = pull(dat_test, before1980)
  )

preds_bt <- bind_cols(
    predict(bt_model, new_data = dat_test),
    predict(bt_model, dat_test, type = "prob"),
    truth = pull(dat_test, before1980)
  )

Now, we can evaluate our prediction performance.

let’s combine all three pred_ dataframes into one for combined summaries.