Search code examples
rencodingdata.tablecategorical-databinning

r data.table usage in function call


I want to perform a data.table task over and over in a function call: Reduce number of levels for large categorical variables My problem is similar to Data.table and get() command (R) or pass column name in data.table using variable in R but I can't get it to work

Without a function call this works just fine:

# Load data.table
require(data.table)

# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                 weight = rnorm(n = 10e3, mean = 70, sd = 20))

# Decide the minimum frequency a level needs...
min.freq <- 3350

# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]

# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"

but wrapped like

reduceCategorical <- function(variableName, min.freq){
  fail.min.f <- dt[, .N, variableName][N < min.freq, variableName]
  levels(dt[, variableName][fail.min.f]) <- "Other"
}

I only get errors like:

 reduceCategorical(dt$x, 3350)
Fehler in levels(df[, variableName][fail.min.f]) <- "Other" : 
 trying to set attribute of NULL value

And sometimes

Error is: number of levels differs

Solution

  • One possibility is to define your own re-leveling function using data.table::setattr that will modify dt in place. Something like

    DTsetlvls <- function(x, newl)  
       setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))
    

    Then use it within another predefined function

    f <- function(variableName, min.freq){
      fail.min.f <- dt[, .N, by = variableName][N < min.freq, get(variableName)]
      dt[, DTsetlvls(get(variableName), fail.min.f)]
      invisible()
    }
    
    f("type", min.freq)
    levels(dt$type)
    # [1] "C"     "other"
    

    Some other data.table alternatives

    f <- function(var, min.freq) {
      fail.min.f <- dt[, .N, by = var][N < min.freq, get(var)]
      dt[get(var) %in% fail.min.f, (var) := "Other"]
      dt[, (var) := factor(get(var))]
    }
    

    Or using set/.I

    f <- function(var, min.freq) {
      fail.min.f <- dt[, .I[.N < min.freq], by = var]$V1
      set(dt, fail.min.f, var, "other")
      set(dt, NULL, var, factor(dt[[var]]))
    }
    

    Or combining with base R (doesn't modify original data set)

    f <- function(df, variableName, min.freq){
      fail.min.f <- df[, .N, by = variableName][N < min.freq, get(variableName)]
      levels(df$type)[fail.min.f] <- "Other"
      df
    } 
    

    Alternatively, we could stick we characters instead (if type is a character), you could simply do

    f <- function(var, min.freq) dt[, (var) := if(.N < min.freq) "other", by = var]