I need to filter a dataframe by multiple "OR" conditions. Let me explain.
I have a dataframe (total) with 1 million observations. One of the columns (id) contains id numbers ranging from 1 to 6000. This means that many of the rows have duplicate id numbers.
I previously drew a random sample of 500 unique id numbers.
random.id <- sample(abc, 500, replace=F)
I want to filter those rows in my original dataset where the id column matches any of values inrandom.id. In other words, I want to filter with many "OR" conditions. But since there are 500 conditions, I cant type them all out.
I've tried using the %in% operator.
filtered <- total %>%
filter(id %in% random.id)
If the command worked as intended, then the new filtered dataframe should contain 500 unique id values.
length(unique(filtered$id))
Unfortunately, this number is way under 500. I re do the random sample for random.id but the the number of unique ids in the new dataframe is always under 500.
What should I do?
Since you're using dplyr
, here's a version of @Jon Spring's answer in dplyr
syntax.
It does look like your issue is related to the contents of abc
.
library(dplyr)
random_id <- sample(1:1000, 500, replace = F)
total <- tibble(id = sample(1:6000, 1e6, replace = T))
filtered <- total %>% filter(id %in% random_id)
n_distinct(filtered$id) # 500
Note: dplyr::n_distinct
saves having to make two calls to length
and unique
.