Search code examples
rpermutationrandom-foresttabular

Is there a way to generate a table with multiple, non-unique headers and permute columns based on those headers in R?


Overall context: I aim to employ permutation of groups of variables, then predict the accuracy loss on a pre-generated Random Forest model.

The initial Random Forest model was created on the a dataset containing ~100,000 features. These 100,000 features can be grouped (using domain knowledge) into ~5,000 groups, with each group's size being variable (meaning there will be different numbers of features in each group) and each feature is not specific to a given group (meaning some features will be represented in multiple groups). The 100,000 features are 100% unique.

I would like to create a tabular format where the headers are multiple "rows" (not data rows, but header rows). Then I would like to permute based on the header row of each group. Meaning when I permute "Group_1", all features that have "Group_1" as a header in any header row are permuted (regardless of other non-Group1 headers being present).

Here is an excel mock-up on what this format of dataset would look like: enter image description here

From there, I believe I can permute based on the header (e.g. permute all columns that are in Group_1 then predict on the RF model, record accuracy, and repeat with the next group). Any suggestions on this step are also appreciated.

Here is a toy dataset:

Feature_1 <- c(17,3,5,98)
Feature_2 <- c(21000,23400,26800,73)
Feature_3 <- c(77,2008,445,32)
df <- data.frame(Feature_1,Feature_2,Feature_3)
df

enter image description here

Here is the key that informs which groups each feature is in:

Features <- c('Feature_1','Feature_2','Feature_3', 'Feature_1')
Groups   <- c('Group1', 'Group_1','Group_1', 'Group_2')
key <- data.frame(Features, Groups)
key

enter image description here

I am certain there is more than one way to do this, but this is the way my excel-oriented brain can come up with. I am happy to learn other approaches as long as they meet the requirements of the overall goal. Please keep in mind that there are thousands of features and groups, so a scalable solution is preferred.


Solution

  • Here's how I would do it. To store the groups, I would use a logical array, where rows correspond to groups, and columns to the columns of your dataframe.

    For what you've shown, that would be:

    groups <- matrix(FALSE, nrow = 91, ncol = ncol(df), 
                     dimnames = list(paste0("Group_", 1:91),
                                     colnames(df))
    groups["Group_1", 1:3] <- TRUE
    groups["Group_2", 3:6] <- TRUE
    groups["Group_3", c(1,6,7)] <- TRUE
    groups[c("Group_7", "Group_17", "Group_91"), 2] <- TRUE
    

    Then to permute "Group_1", do this:

    newdf <- df
    group <- "Group_1"
    columns <- which(groups[group,])
    for (i in columns)
      newdf[,i] <- sample(newdf[,i])