P4D4: Model Explainers R
We are going to leverage the DALEX R Package. There are other packages like fastshap and iml as well. We are using DALEX based on recommendations from Tidymodels developers.
The book
Przemyslaw Biecek and Tomasz Burzykowski provide a textbook that covers the DALEX package and the mathematics behind the functions. We don’t have time to discuss the mechanics and nuance provided in the book today.
The code
Setup
library(tidyverse)
library(tidymodels)
library(DALEX)
library(vip)
library(patchwork)
httpgd::hgd()
httpgd::hgd_browse()
I want to remove large square footage homes from our previous model_ml
object and fit to a subset of the variables.
dat_ml <- read_rds("dat_ml.rds") %>%
select(arcstyle_ONE.AND.HALF.STORY, arcstyle_ONE.STORY, numbaths,
tasp, livearea, basement, condition, stories, quality, before1980) %>%
filter(livearea < 5500) # 99th percentile is 5429.04
Train/Test
Now we can split our data. Notice that we have to create a new dat_exp
that has our target as a 0/1
object. Also, notice that 0
is before1980.
set.seed(76)
dat_split <- initial_split(dat_ml, prop = 2/3, strata = before1980)
dat_train <- training(dat_split)
dat_test <- testing(dat_split)
dat_exp <- mutate(dat_train, before1980 = as.integer(dat_train$before1980) - 1)
head(dat_e$before1980)
head(dat_train$before1980)
Model fit
The naive Bayes model didn’t work well. Let’s just use the logistic and xgboost model.
bt_model <- boost_tree() %>%
set_engine(engine = "xgboost") %>%
set_mode("classification") %>%
fit(before1980 ~ ., data = dat_train)
logistic_model <- logistic_reg() %>%
set_engine(engine = "glm") %>%
set_mode("classification") %>%
fit(before1980 ~ ., data = dat_train)
vip(bt_model)
vip(logistic_model)
DALEX: Model explanation
We need to build our explainer objects using the DALEX package.
- Notice that we have to remove our target and then use the
0/1
target fromdat_exp
.
explainer_bt <- DALEX::explain(
bt_model,
select(dat_exp, -before1980), dat_exp$before1980, label = "Boosted Trees")
explainer_logistic <- DALEX::explain(
logistic_model,
select(dat_exp, -before1980), dat_exp$before1980, label = "Logistic Regression")
DALEX: Model performance
We can use the two model residual plots to compare our two models.
performance_logistic <- model_performance(explainer_logistic)
performance_bt <- model_performance(explainer_bt)
plot(performance_bt, performance_logistic)
plot(performance_bt, performance_logistic, geom = "boxplot")
We can also look at the feature importance comparison as well. Notice that the feature importance and ordering are a bit different than those shown by vip()
. We have defined the loss_function
that is different from the default.
logistic_parts <- model_parts(explainer_logistic,
loss_function = loss_root_mean_square)
bt_parts <- model_parts(explainer_bt,
loss_function = loss_root_mean_square)
plot(logistic_parts, bt_parts, max_vars = 10)
DALEX: Explain score
onehouse_before <- predict_parts(explainer_bt,
new_observation = select(dat_exp, -before1980) %>%
dplyr::slice(13800), type = "break_down")
onehouse_after <- predict_parts(explainer_bt,
new_observation = select(dat_exp, -before1980) %>%
dplyr::slice(8), type = "break_down")
plot(onehouse_after) + plot(onehouse_before)
dat_train %>% dplyr::slice(c(8, 13800))
Saving my material
- Fork the course_guide
- Clone my project code. Please keep these files private.
Final Coding Challenge