Search code examples
rmatrixsubsetr-mice

Automatic subsetting of a dataframe on the basis of a prediction matrix


I have created a prediction matrix for large dataset as follows:

library(mice)
dfpredm <- quickpred(df, mincor=.3)

    A   B   C   D   E   F   G   H   I   J
A   0   1   1   1   0   1   0   1   1   0
B   1   0   0   0   1   0   1   0   0   1
C   0   0   0   1   1   0   0   0   0   0
D   1   0   1   0   0   1   0   1   0   1
E   0   1   0   1   0   1   1   0   1   0
**F 0   0   1   0   0   0   1   0   0   0**
G   0   1   0   1   0   0   0   0   0   0
H   1   0   1   0   0   1   0   0   0   1
I   0   1   0   1   1   0   1   0   0   0
J   1   0   1   0   0   1   0   1   0   0

I would like to create a subset of the original df on the basis on dfpredm.

More specifically I would like to do the following:

Let's assume that my dependent variable is F. According to the prediction matrix F is correlated with C and G. In addition, C and G are best predicted by D,E and B,D respectively.

The idea is now to create a subset of df based on the dependent variable F,for which in the F row the value is 1.

Fpredictors <- df[,(dfpredm["F",]) == 1]

But also do the same for the variables where the rows in F are 1. I am thinking of first getting the column names like this:

Fpredcol <-colnames(dfpredm[,(dfpredm["c241",]) == 1])

And then doing a for loop with these column names?

For the specific example I would like to end up with the subset.

dfsub <- df[,c("F","C","G","B","E","D")]

I would however like to automate this process. Could anyone show me how to do this?


Solution

  • Here is one strategy that seems like it would work for you:

    first_preds <- function(dat, predictor) {
      cols <- which(dat[predictor, ] == 1)
      names(dat)[cols]
    }
    
    # wrap first_preds() for getting best and second best predictors
    first_and_second_preds <- function(dat, predictor) {
      matches <- first_preds(dat, predictor)
      matches <- c(matches, unlist(lapply(matches, function(x) first_preds(dat, x))))
      c(predictor, matches) %>% unique()
    }
    
    dat[first_and_second_preds(dat, "F")] # order is not exactly the same as your output
    
      F C G D E B
    A 1 1 0 1 0 1
    B 0 0 1 0 1 0
    C 0 0 0 1 1 0
    D 1 1 0 0 0 0
    E 1 0 1 1 0 1
    F 0 1 1 0 0 0
    G 0 0 0 1 0 1
    H 1 1 0 0 0 0
    I 0 0 1 1 1 1
    J 1 1 0 0 0 0
    

    Not sure if the ordering in the result is important, but you could add the logic if it is.

    Using dat from here (a kinder way to share small R data on SO):

    dat <- read.table(
      text = "A   B   C   D   E   F   G   H   I   J
      A   0   1   1   1   0   1   0   1   1   0
      B   1   0   0   0   1   0   1   0   0   1
      C   0   0   0   1   1   0   0   0   0   0
      D   1   0   1   0   0   1   0   1   0   1
      E   0   1   0   1   0   1   1   0   1   0
      F   0   0   1   0   0   0   1   0   0   0
      G   0   1   0   1   0   0   0   0   0   0
      H   1   0   1   0   0   1   0   0   0   1
      I   0   1   0   1   1   0   1   0   0   0
      J   1   0   1   0   0   1   0   1   0   0",
      header = TRUE
    )
    

    Something a little more general that would let you use self_select predictors directly:

    all_preds <- function(dat, predictors) {
      unlist(lapply(predictors, function(x) names(dat)[which(dat[x, ] == 1 )]))
    }
    
    dat[all_preds(dat, c("A", "B"))]
    
      B C D F H I A E G J
    A 1 1 1 1 1 1 0 0 0 0
    B 0 0 0 0 0 0 1 1 1 1
    C 0 0 1 0 0 0 0 1 0 0
    D 0 1 0 1 1 0 1 0 0 1
    E 1 0 1 1 0 1 0 0 1 0
    F 0 1 0 0 0 0 0 0 1 0
    G 1 0 1 0 0 0 0 0 0 0
    H 0 1 0 1 0 0 1 0 0 1
    I 1 0 1 0 0 0 0 1 1 0