Search code examples
rsubset

How can I use a value extracted from a dataframe to specify columns to subset in R?


I have a dataframe that I want to subset inside a function so that only rows where both columns are either 1 or NA remain. For df:

df <- data.frame(a = c(1,1,0,NA,0,1), 
                 b = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

I want:

   a  b  c
2  1  1  0
4 NA  1  0
6  1 NA  0

The problem I'm having is I have many columns with names that change. So this works well:

subset(df, (is.na(a) | a == 1) & (is.na(b) | b == 1))

but when column names 'a' and 'b' become 'd' and 'f' during the operation of the function it breaks. Specifying by column index works more robustly:

subset(df, (is.na(df[,1]) | df[,1] == 1) & (is.na(df[,2]) | df[,2] == 1))

But is cumbersome, and if a previous processing step messes up and column 'c' ends up before 'a' or 'b' I end up subsetting by the wrong columns.

I also have another dataframe that specifies what the column names to subset by will be:

cro_df <- data.frame(pop = c('c1', 'c2'),
                     p1 = c('a', 'd'),
                     p2 = c('b', 'f'))
  pop p1 p2
1  c1  a  d
2  c2  b  f

I would like to be able to extract the column names from that dataframe to use in my subset function, e.g.:

col1 <- cro_df[cro_df[,'pop']=='c1', 'p1']
subset(df, is.na(col1) | col1 == 1)

This returns an empty dataframe. I have tried turning col1 into a symbol and a factor with no success:

subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)

And they all return:

[1] a b c
<0 rows> (or 0-length row.names)

Is there a way I can specify my columns to subset using the second dataframe cro_df?


Solution

  • You can use filter and if_all from the dplyr package.

    Select in the manner you find best suited for your case the names of the columns you want to filter. In my case I just created a variable cols that contains 'a' and 'b'.

    Then I check all_of the column names in cols and filter the rows if_all statements are TRUE:

    library(dplyr) # packageVersion("dplyr") >= 1.1.0
    
    cols <- c('a', 'b')
    filter(df, if_all(all_of(cols), \(x) is.na(x) | x == 1))
    #>    a  b c
    #> 1  1  1 0
    #> 2 NA  1 0
    #> 3  1 NA 0
    

    If you assign different column names to cols you can reuse the same code.