I wrote this function to transform a dataset with categorical variables with only the observed combinations of categories into a dataset that contains all the possible combinations that can be found with the variables and categories considered. Moreover I want to have a column "y" that contains the frequency of the row, so how many times I have observed in my original dataset that pattern.
datafreq <- function(dati) {
dati = na.omit(dati)
dati[] = lapply(dati, factor)
dati <- dati %>%
group_by_all(.drop = FALSE) %>%
summarise(y = n()) %>%
ungroup()
return(as.data.frame(dati))
}
So for example you have the dataset votes that can be found in UCI Machine Learning repository https://archive.ics.uci.edu/ml/datasets/congressional+voting+records.
It has 435 rows, each with a combination of values of the 16 categorical variables.
I would like to obtain a dataset with 3^16 rows, one for each combination (observed and not observed) of the categories of the variables (each variable has 3 possible categories). It should be created a variable y that counts the number of times each combination has been observed in the data (0 if never been observed).
The function that I wrote works well and do what it should with very small dataset. However, when used with large dataset like votes, R crashes. So I am asking whether there exists a more efficient function that I can use, how can I improve in efficiency?
Thanks!
With data.table
:
library(data.table)
dt <- fread("C:/temp/house-votes-84.data", header = FALSE)
u <- lapply(dt, unique)
a <- rev(cumprod(c(1, rev(lengths(u)[-1]))))
dt2 <- do.call(CJ, lapply(lengths(u), seq.int))
dt2[,counts := tabulate(colSums(t(mapply(function(i) match(dt[[i]], u[[i]]), seq_along(dt)) - 1)*a) + 1, .N)]
nrow(dt) == sum(dt2$counts)
#> [1] TRUE
The values in dt2
are the indices of the values in u
, so dt2[i, j]
corresponds to u[[j]][dt2[i, j]]
.