Search code examples
rfeature-selectiondummy-variable

Variable Importance Dummy Variables R


How can I determine variable importance (vip package in r) for categorical predictors when they have been one-hot encoded? It seems impossible for r to do this when the model is built on the dummy variables rather than the original categorical predictor.

I will demonstrate what I mean with the Ames Housing dataset. I am going to use two categorical predictors. Street (two levels) and Sale.Type (ten levels). I converted them from characters to factors.

library(AmesHousing)
df <- data.frame(ames_raw)

# convert characters to factors 
df <- df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
            data = df,
            method = "lm")

vip(mod_lm)

enter image description here

The variable importance ranks them by each level, rather than the original predictor. I can see StreetPave is important, but I cannot see if Street is important.


Solution

  • From the caret documentation, we see that variable importance in linear models corresponds to the absolute value of the t-statistic for each covariate. So, we can manually compute it, as I do in the code below.

    lm() automatically converts categorical variables as dummies. So, to get the importance of each covariate, we have to sum over dummies. I did not find a way to automate this, so if you want to apply my solution to a different set of variables, you need to be careful in choosing the items of t.stats to be summed.

    Finally, we can use results for plotting. I just used the baseline function for a bar plot, but you can customize it as you want (maybe also using the ggplot2 package for better visualization).

    Ps when you provide a reproducible example, remember to load all the needed packages.

    Pps summing over dummies may be sensitive to the base level of the dummy we are using (i.e., the level we omit from the regression). I do not know if that could be an issue.

    library(AmesHousing)
    library(caret)
    library(dplyr)
    
    df = data.frame(ames_raw)
    
    # convert characters to factors
    df = df%>%mutate_if(is.character, as.factor)
    
    # train and split code from caret datacamp
    # Get the number of observations
    n_obs <- nrow(df)
    
    # Shuffle row indices: permuted_rows
    permuted_rows <- sample(n_obs)
    
    # Randomly order data: 
    df_shuffled <- df[permuted_rows, ]
    
    # Identify row to split on: split
    split <- round(n_obs * 0.7)
    
    # Create train
    train <- df_shuffled[1:split, ]
    
    # Create test
    test <- df_shuffled[(split + 1):n_obs, ]
    
    mod_lm <- train(SalePrice ~ Street + Sale.Type,
                    data = df,
                    method = "lm")
    
    # Manually computing variable importance from t-statistics of the model.
    t.stats = coef(summary(mod_lm))[, "t value"]
    imp.sale = sum(t.stats[-(1:2)])
    imp.street = t.stats[2]
    
    # Plotting.
    barplot(c(imp.sale, imp.street), names.arg = c("Sale", "Street"))