Search code examples
rfiltermatchsubset

Subsetting dataframe based on lots of columns in R


thanks in advance for any help! I have a dataframe with lots of columns, and I'd like to filter it to only display rows that match specific values in each of those columns. I can easily produce the conditions I'd like to filter by, but I feel like there's an easier way to actually filter the data than by filtering for each value individually.

For a reproducible example, I'll generate a dataset that has lots of columns that will produce matching entries using anagrams. I'm not just looking for all duplicates, but each of them have duplicates -- I'm looking for duplicates of a given test value (test below). Glad for any help!

# load my important libraries
library(dplyr); library(stringr)


corpus <- tibble(word  = c("tables", "stables",
                           "elbow", "below", "bowel",
                           "ascot", "coats", "coast", "tacos",
                           "aridest", "astride", "staider", "tardies", "tirades"))

corpus[letters] <- NA

corpus %<>%
  pivot_longer(cols = letters) %>%
  mutate(value = str_count(word, name)) %>%
  pivot_wider(names_from = name, values_from = value)


# I'm choosing a completely arbitrary row (8) as an example value to look for.
test <- corpus[8, letters]

I want to determine which rows in my dataframe, the corpus, have the same combination of letters as my test value. I can certainly filter the corpus one letter at a time, but it's so inelegant:

corpus %>%
  filter(a == test$a,
         b == test$b,
         c == test$c,
         d == test$d,
         e == test$e,
         # etc . . .
         x == test$x,
         y == test$y,
         z == test$z)

I feel like there has to be an easier way to do this using either the match, which, or %in% commands, so that it could match even 100 columns if necessary without me writing out each individual one. My ideal solution would look something like this:

corpus %>%
  which(test)

or

which(corpus[letters] == test[letters])

and give me the results:

# A tibble: 4 × 27
  word      a     b     c     d     e     f     g     h     i     j     k     l     m     n     o     p     q     r     s
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
2 coats     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
3 coast     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
4 tacos     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
# … with 7 more variables: t <int>, u <int>, v <int>, w <int>, x <int>, y <int>, z <int>

Solution

  • Using the (paste(x, collapse="")) function to collapse all the columns into a single string and then compare.

    corpus[apply(X=corpus[-1], MARGIN=1, FUN=function(x) {paste(test, collapse = "")==paste(x, collapse = "")}),]
    # A method without the anonymous function and the comparison outside of apply
    # Small speed improvement:
    corpus[apply(corpus[-1],1,paste,collapse="")==paste(test, collapse = ""),]
    # Do.Call method, Fastest yet:
    corpus[ do.call(paste0, corpus[-1]) == paste(test, collapse = ""),]
    
      word      a     b     c     d     e     f     g     h     i     j     k     l     m     n     o     p
    1 ascot     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
    2 coats     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
    3 coast     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
    4 tacos     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0