P4D3: From the Tidyverse to Tidymodels
install.packages("tidyverse")
install.packages("tidymodels")
install.packages("visdat")
First we need to get our data built for our ML model
We should ..
- download the
.csv
data. It is small enough that we can version it with our code. - Start our
data_process.r
script and import our data.
Our setup
library(tidyverse)
library(tidymodels)
library(visdat)
httpgd::hgd()
httpgd::hgd_browse()
Our initial import
dat <- read_csv("SalesBook_2013.csv") %>%
select(NBHD, PARCEL, LIVEAREA, FINBSMNT,
BASEMENT, YRBUILT, CONDITION, QUALITY,
TOTUNITS, STORIES, GARTYPE, NOCARS,
NUMBDRM, NUMBATHS, ARCSTYLE, SPRICE,
DEDUCT, NETPRICE, TASP, SMONTH,
SYEAR, QUALIFIED, STATUS) %>%
rename_all(str_to_lower) %>%
filter(
totunits <= 2,
yrbuilt != 0,
condition != "None")
Let’s clean our data and establish our machine learning data
We need to do the following;
- Create our target variable (R uses factors not 0/1).
- Handle
gartype
andarcstyle
nominal variables. - Handle
quality
andcondition
ordinal variables. - Create our garage type varaibles -
attachedGarage
,detachedGarage
,carportGarage
, andnoGarage
. - Remove duplicated parcels.
- Drop
nbhd
,parcel
,status
,qualified
,gartype
andyrbuilt
columns - Fix columns with missing values.
- One-hot-encode
arcstyle
.
As we move through this process let’s leverage visdat::vis_dat()
Here are some useful snippets.
# quality case_when
quality = case_when(
quality == "E-" ~ -0.3, quality == "E" ~ 0,
quality == "E+" ~ 0.3, quality == "D-" ~ 0.7,
quality == "D" ~ 1, quality == "D+" ~ 1.3,
quality == "C-" ~ 1.7, quality == "C" ~ 2,
quality == "C+" ~ 2.3, quality == "B-" ~ 2.7,
quality == "B" ~ 3, quality == "B+" ~ 3.3,
quality == "A-" ~ 3.7, quality == "A" ~ 4,
quality == "A+" ~ 4.3, quality == "X-" ~ 4.7,
quality == "X" ~ 5, quality == "X+" ~ 5.3)
# condition case_when
condition = case_when(
condition == "Excel" ~ 3,
condition == "VGood" ~ 2,
condition == "Good" ~ 1,
condition == "AVG" ~ 0,
condition == "Avg" ~ 0,
condition == "Fair" ~ -1,
condition == "Poor" ~ -2)
colMeans(select(dat, nocars, numbdrm, numbaths), na.rm = TRUE)
Now we can leverage step_dummy()
.
dat_ml <- dat %>%
recipe(before1980 ~ ., data = dat) %>%
step_dummy(arcstyle) %>%
prep() %>%
juice()
glimpse(dat_ml)
Now we can save our prepped data.
write_rds(dat_ml, "dat_ml.rds")
# arrow::write_feather(dat_ml, "dat_ml.feather")
Tidymodels and our model development
We need a few more R packages as we use tidymodels.
install.packages("tidymodels")
install.packages("DALEX")
install.packages("discrim")
install.packages("naivebayes")
install.packages("vip")
install.packages("xgboost")
install.packages("patchwork")
install.packages("GGally")
_Now we can start our machine learning script - model.r
library(tidyverse)
library(tidymodels)
library(DALEX)
library(skimr)
library(GGally)
library(xgboost)
library(vip)
library(patchwork)
httpgd::hgd()
httpgd::hgd_browse()
dat_ml <- read_rds("dat_ml.rds")
Training and Testing Data
The scope of
rsample
is to provide the basic building blocks for creating and analyzing resamples of a data set, but this package does not include code for modeling or calculating statistics.
set.seed(76)
dat_split <- initial_split(dat_ml, prop = 2/3, strata = before1980)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)
Model fit
The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.
bt_model <- boost_tree() %>%
set_engine(engine = "xgboost") %>%
set_mode("classification") %>%
fit(before1980 ~ ., data = dat_train)
logistic_model <- logistic_reg() %>%
set_engine(engine = "glm") %>%
set_mode("classification") %>%
fit(before1980 ~ ., data = dat_train)
nb_model <- discrim::naive_Bayes() %>%
set_engine(engine = "naivebayes") %>%
set_mode("classification") %>%
fit(before1980 ~ ., data = dat_train)
Feature importance
vip is an R package for constructing variable importance plots (VIPs). VIPs are part of a larger framework referred to as interpretable machine learning (IML)
vip(bt_model, num_features = 20) + vip(logistic_model, num_features = 20)
Evaluating our predictions
yardstick is a package to estimate how well models are working using tidy data principles.
First, we can build our prediction datasets.
preds_logistic <- bind_cols(
predict(logistic_model, new_data = dat_test),
predict(logistic_model, dat_test, type = "prob"),
truth = pull(dat_test, before1980)
)
# takes a minute
preds_nb <- bind_cols(
predict(nb_model, new_data = dat_test),
predict(nb_model, dat_test, type = "prob"),
truth = pull(dat_test, before1980)
)
preds_bt <- bind_cols(
predict(bt_model, new_data = dat_test),
predict(bt_model, dat_test, type = "prob"),
truth = pull(dat_test, before1980)
)
Now, we can evaluate our prediction performance.
let’s combine all three pred_
dataframes into one for combined summaries.