Overall context: I aim to employ permutation of groups of variables, then predict the accuracy loss on a pre-generated Random Forest model.
The initial Random Forest model was created on the a dataset containing ~100,000 features. These 100,000 features can be grouped (using domain knowledge) into ~5,000 groups, with each group's size being variable (meaning there will be different numbers of features in each group) and each feature is not specific to a given group (meaning some features will be represented in multiple groups). The 100,000 features are 100% unique.
I would like to create a tabular format where the headers are multiple "rows" (not data rows, but header rows). Then I would like to permute based on the header row of each group. Meaning when I permute "Group_1", all features that have "Group_1" as a header in any header row are permuted (regardless of other non-Group1 headers being present).
Here is an excel mock-up on what this format of dataset would look like:
From there, I believe I can permute based on the header (e.g. permute all columns that are in Group_1 then predict on the RF model, record accuracy, and repeat with the next group). Any suggestions on this step are also appreciated.
Here is a toy dataset:
Feature_1 <- c(17,3,5,98)
Feature_2 <- c(21000,23400,26800,73)
Feature_3 <- c(77,2008,445,32)
df <- data.frame(Feature_1,Feature_2,Feature_3)
df
Here is the key that informs which groups each feature is in:
Features <- c('Feature_1','Feature_2','Feature_3', 'Feature_1')
Groups <- c('Group1', 'Group_1','Group_1', 'Group_2')
key <- data.frame(Features, Groups)
key
I am certain there is more than one way to do this, but this is the way my excel-oriented brain can come up with. I am happy to learn other approaches as long as they meet the requirements of the overall goal. Please keep in mind that there are thousands of features and groups, so a scalable solution is preferred.
Here's how I would do it. To store the groups, I would use a logical array, where rows correspond to groups, and columns to the columns of your dataframe.
For what you've shown, that would be:
groups <- matrix(FALSE, nrow = 91, ncol = ncol(df),
dimnames = list(paste0("Group_", 1:91),
colnames(df))
groups["Group_1", 1:3] <- TRUE
groups["Group_2", 3:6] <- TRUE
groups["Group_3", c(1,6,7)] <- TRUE
groups[c("Group_7", "Group_17", "Group_91"), 2] <- TRUE
Then to permute "Group_1", do this:
newdf <- df
group <- "Group_1"
columns <- which(groups[group,])
for (i in columns)
newdf[,i] <- sample(newdf[,i])