Link Search Menu Expand Document

P4D4: Model Explainers R

We are going to leverage the DALEX R Package. There are other packages like fastshap and iml as well. We are using DALEX based on recommendations from Tidymodels developers.

The book

Przemyslaw Biecek and Tomasz Burzykowski provide a textbook that covers the DALEX package and the mathematics behind the functions. We don’t have time to discuss the mechanics and nuance provided in the book today.

The code

Setup

library(tidyverse)
library(tidymodels)
library(DALEX)
library(vip)
library(patchwork)


httpgd::hgd()
httpgd::hgd_browse()

I want to remove large square footage homes from our previous model_ml object and fit to a subset of the variables.

dat_ml <- read_rds("dat_ml.rds") %>%
    select(arcstyle_ONE.AND.HALF.STORY, arcstyle_ONE.STORY, numbaths,
        tasp, livearea, basement, condition, stories, quality, before1980) %>%
    filter(livearea < 5500) # 99th percentile is 5429.04

Train/Test

Now we can split our data. Notice that we have to create a new dat_exp that has our target as a 0/1 object. Also, notice that 0 is before1980.

set.seed(76)
dat_split <- initial_split(dat_ml, prop = 2/3, strata = before1980)

dat_train <- training(dat_split)
dat_test <- testing(dat_split)
dat_exp <- mutate(dat_train, before1980 = as.integer(dat_train$before1980) - 1)

head(dat_e$before1980)
head(dat_train$before1980)

Model fit

The naive Bayes model didn’t work well. Let’s just use the logistic and xgboost model.

bt_model <- boost_tree() %>%
    set_engine(engine = "xgboost") %>%
    set_mode("classification") %>%
    fit(before1980 ~ ., data = dat_train)

logistic_model <- logistic_reg() %>%
    set_engine(engine = "glm") %>%
    set_mode("classification") %>%
    fit(before1980 ~ ., data = dat_train)

vip(bt_model)
vip(logistic_model)

DALEX: Model explanation

We need to build our explainer objects using the DALEX package.

  • Notice that we have to remove our target and then use the 0/1 target from dat_exp.
explainer_bt <- DALEX::explain(
    bt_model,
    select(dat_exp, -before1980), dat_exp$before1980, label = "Boosted Trees")

explainer_logistic <- DALEX::explain(
    logistic_model,
    select(dat_exp, -before1980), dat_exp$before1980, label = "Logistic Regression")

DALEX: Model performance

We can use the two model residual plots to compare our two models.

performance_logistic <- model_performance(explainer_logistic)
performance_bt <- model_performance(explainer_bt)

plot(performance_bt, performance_logistic)
plot(performance_bt, performance_logistic, geom = "boxplot")

We can also look at the feature importance comparison as well. Notice that the feature importance and ordering are a bit different than those shown by vip(). We have defined the loss_function that is different from the default.

logistic_parts <- model_parts(explainer_logistic, 
    loss_function = loss_root_mean_square)
bt_parts <- model_parts(explainer_bt,
    loss_function = loss_root_mean_square)

plot(logistic_parts, bt_parts, max_vars = 10)

DALEX: Explain score

onehouse_before <- predict_parts(explainer_bt,
    new_observation = select(dat_exp, -before1980) %>%
        dplyr::slice(13800), type = "break_down")

onehouse_after <- predict_parts(explainer_bt,
    new_observation = select(dat_exp, -before1980) %>%
        dplyr::slice(8), type = "break_down")

plot(onehouse_after) + plot(onehouse_before)

dat_train %>% dplyr::slice(c(8, 13800))

Saving my material

See Github