Search code examples
rfunctionperformancedatasetcategorical-data

How to improve the efficiency of a function


I wrote this function to transform a dataset with categorical variables with only the observed combinations of categories into a dataset that contains all the possible combinations that can be found with the variables and categories considered. Moreover I want to have a column "y" that contains the frequency of the row, so how many times I have observed in my original dataset that pattern.

datafreq <- function(dati) {
  dati = na.omit(dati)
  dati[] = lapply(dati, factor)
  dati <-  dati %>%
    group_by_all(.drop = FALSE) %>%
    summarise(y = n()) %>%
    ungroup()
  return(as.data.frame(dati))
}

So for example you have the dataset votes that can be found in UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/congressional+voting+records.

It has 435 rows, each with a combination of values of the 16 categorical variables.

I would like to obtain a dataset with 3^16 rows, one for each combination (observed and not observed) of the categories of the variables (each variable has 3 possible categories). It should be created a variable y that counts the number of times each combination has been observed in the data (0 if never been observed).

The function that I wrote works well and do what it should with very small dataset. However, when used with large dataset like votes, R crashes. So I am asking whether there exists a more efficient function that I can use, how can I improve in efficiency?

Thanks!


Solution

  • With data.table:

    library(data.table)
    
    dt <- fread("C:/temp/house-votes-84.data", header = FALSE)
    u <- lapply(dt, unique)
    a <- rev(cumprod(c(1, rev(lengths(u)[-1]))))
    dt2 <- do.call(CJ, lapply(lengths(u), seq.int))
    dt2[,counts := tabulate(colSums(t(mapply(function(i) match(dt[[i]], u[[i]]), seq_along(dt)) - 1)*a) + 1, .N)]
    nrow(dt) == sum(dt2$counts)
    #> [1] TRUE
    

    The values in dt2 are the indices of the values in u, so dt2[i, j] corresponds to u[[j]][dt2[i, j]].