Search code examples
rrandomdplyrfilterdata-wrangling

R: filter by multiple OR conditions


I need to filter a dataframe by multiple "OR" conditions. Let me explain.

I have a dataframe (total) with 1 million observations. One of the columns (id) contains id numbers ranging from 1 to 6000. This means that many of the rows have duplicate id numbers.

I previously drew a random sample of 500 unique id numbers.

random.id <- sample(abc, 500, replace=F)

I want to filter those rows in my original dataset where the id column matches any of values inrandom.id. In other words, I want to filter with many "OR" conditions. But since there are 500 conditions, I cant type them all out.

I've tried using the %in% operator.

filtered <- total %>%
  filter(id %in% random.id)

If the command worked as intended, then the new filtered dataframe should contain 500 unique id values.

length(unique(filtered$id))

Unfortunately, this number is way under 500. I re do the random sample for random.id but the the number of unique ids in the new dataframe is always under 500.

What should I do?


Solution

  • Since you're using dplyr, here's a version of @Jon Spring's answer in dplyr syntax.
    It does look like your issue is related to the contents of abc.

    library(dplyr)
    
    random_id <- sample(1:1000, 500, replace = F)
    total <- tibble(id = sample(1:6000, 1e6, replace = T))
    
    filtered <- total %>% filter(id %in% random_id)
    
    n_distinct(filtered$id) # 500
    

    Note: dplyr::n_distinct saves having to make two calls to length and unique.