Search code examples
rdata.tablelapplysummarization

unable to use lapply with data.table


I am trying to create a summary of all character variables in a data.table. Basically to get total observation count, missing values, category with highest frequency etc. However I am not able correctly use lapply for the same. Here is a reproducible example.

library(data.table)

#Function to analyze one variable at a time
analyze_char_var <- function(x) {
  y = names(x)
  z = x[,.N,by=y]
  w = setorder(z,-N)

  out = data.table( 
    total_obs = nrow(x),
    missing_obs = sum(is.na(x)),
    unique_cats = nrow(z),
    top_cat = z[1,1],
    top_freq = z[1,2]
  )
  return(out)
}
#Function to analyze all variables. I want to use lapply instead of loop
analyze_all_char <- function(dt) {
  dt.char = dt[,sapply(dt,class)=="character", with=FALSE]
  mylist = vector('list', length(dt.char))
  for (i in 1:length(dt.char)){
    x = dt.char[,i,with=FALSE]
    mylist[[i]] = analyze_char_var(x)
  }
  return(mylist)
}

dt = data.table(
  var1 = c('a', 'a', 'b','b', 'c'),
  var2 = 1:5,
  var3 = c('low','low','high','med',NA)
)
dt.analysis = analyze_all_char(dt)

Just using dt.analysis = dt.char[,lapply(.SD,analyze_char_var)] produces an error Error in x[, .N, by = y] : incorrect number of dimensions. I tried some variations, but could not get it to work.

EDIT: Finally this works for me. However, looks very clumsy. Reconverting the input vector into data.table and then using lapply in a data.frame manner.

test_func <- function(x) {
  dt = as.data.table(x)
  dt.summ = dt[,.N,by='x'] #by default name is x
  # I was stuck in the above line, I was trying all 
  # sort of bad tricks to get the name of grouping variable 


  dt.summ.sorted = setorder(dt.summ,-N)
  out = data.table(
    total_obs = nrow(dt),
    missing_obs = sum(is.na(dt)),
    unique_cats = nrow(dt.summ.sorted),
    top_cat = dt.summ.sorted[1,1],
    top_freq = dt.summ.sorted[1,2]
  )
  return(out)
}

dt.char = dt[,sapply(dt,class)=="character", with=FALSE]
lapply(dt.char,test_func)

Solution

  • I am trying to create a summary of all character variables in a data.table. Basically to get total observation count, missing values, category with highest frequency etc.

    Since all the cols of interest have the same type, you can use melt to go to long form:

    melt(dt.char <- Filter(is.character, dt), meas=names(dt.char))[, {
    
      tabula = setDT(list(value))[, .N, by="V1"][order(-N, V1)]
    
      .(
        NOBS  = .N,
        NNA   = sum(is.na(value)),
        NVALS = nrow(tabula),
        HIVAL = tabula$V1[1L],
        NHI   = tabula$N[1L]
      )
    }, by=variable]
    
    #    variable NOBS NNA NVALS HIVAL NHI
    # 1:     var1    5   0     3     a   2
    # 2:     var3    5   1     4   low   2
    

    To exclude NA as a category (showing up in NVALS and possibly HIVAL, NHI), change [, .N, by="V1"] to [!is.na(V1), .N, by="V1"] above.

    I doubt that performance is important for this task, but this should be reasonably efficient.