Search code examples
rstringlistdataframecomparison

Check if all multiple values in a list exist in a dataframe


I have a dataframe, df, which contains ids = (1, 2, 3, 4), and I have a list, items, which contains ("a", "b", "c"). I want to return the id(s) that contains "a", "b", and "c". It shouldn't return unless the id contains at least all 3 items in the list. This should be scalable to cover if there are n items in the list.

    df <- data.frame(ID = (1, 2, 2, 3, 3, 3, 4, 4, 4, 4), 
                     values = ("b", "a", "c", "a", "b", "c", "a", "b", "c", "d"))
    items <- list("a", "b", "c")

df looks like:

ID values
1 b
2 a
2 c
3 a
3 b
3 c
4 a
4 b
4 c
4 d

The function should return ID = (3, 4), but for ID = 4, only values = ("a", "b", "c") should return. It should not return ID = (1, 2). This is what I tried, but it doesn't return what I want. It's currently returning a dataframe with nothing in it. Each column is NULL.

Criteria.Match <- function(df, CriteriaList, criteria.string){
Pat <- as.data.frame(unique(df$ID))
colnames(Pat) <- 'ID'
Pat.Criteria_Type <- as.data.frame(unique(df[c('ID', criteria.string)]))
Pat$CriteriaMet <- sapply(Pat$ID, FUN = function(x){
       setequal(Pat.Criteria_Type[Pat.Criteria_Type$ID == x,],
       as.data.frame(CriteriaList))
       })
Pat <- Pat[which(Pat$CriteriaMet),]
df[df$ID %in% Pat$ID,]
    }
    
Criteria.Match(df, items, 'values')

Solution

  • Subset the items that are in df based on the values in items. Then, cycle through each ID and check to see if the number of rows of the filtered df is equal to the length of the items list. Then filter out the FALSE values and subset df to be only the ids that exist in the filtered df.

    df <- df[df$values %in% items,]
    for(id in df$ID){
      df_filter <- df %>% filter(ID == id)
      df_filter$Criteria[df_filter$ID == id] <- nrow(unique(df_filter %>% select(values))) >= length(items)
          }
    df_filter <- df_filter %>% filter(Criteria == TRUE)
    df <- df[df$ID %in% df_filter$ID,]