Search code examples
rsplinegam

How to find columns with three or fewer distinct values


I'm using the Boston Housing data set from the MASS package, and working with splines from the gam package in R. However, an error is returned with this code:

library(gam)
library(MASS)
library(tidyverse)

Boston.gam <- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) + s(rad) + s(tax) + s(ptratio) + s(black) + s(lstat), data = Boston)

The error message is:

A smoothing variable encountered with 3 or less unique values; at least 4 needed

The variable that is causing the issue is chas, it only has two values, 1 and 0.

What is a test to determine if a column has 3 or fewer unique values so it can be eliminated from the spline analysis?


Solution

  • Would this work?

    You can use dplyr::n_distinct() to perform the unique check.

    # Number of unique values
    n_unique_vals <- map_dbl(Boston, n_distinct)
    
    # Names of columns with >= 4 unique vals
    keep <- names(n_unique_vals)[n_unique_vals >= 4]
    
    # Model data
    gam_data <- Boston %>%
      dplyr::select(all_of(keep))