Search code examples
rstring-matching

Return a data frame subset based on similar (not identical) elements in a vector?


I have a dataframe (dim 2914 x 6) where one column is a vector of animal groups and species abbreviations, e.g. "bird_F.pw", and I have a separate vector of a few species abbreviations, e.g. "F.pw". I am trying to extract all rows of data where the animal group and species abbreviation from the data frame are similar to the abbreviation (i.e., I don't know the prefixes). I want to use operators like %in% and %like%, but I'm having trouble finding a way to generate non-identical matches.

Here's a sample dataframe:

df<-cbind(
c("A","B","C","D","E"),
c(1:5),
c("insect_F.vp","bird_L.ts","insect_P.qr","insect_V.cl","bird_H.dw"))
colnames(df) <- c("season","survey_id","pollinator")

And here's the vector of abbreviations I would like to search for within that dataframe:

abbrevs <- c("L.ts","P.qr","H.dw")

My anticipated outcome is:

output <- cbind(c("B","C","E"),c(2:3,5),c("bird_L.ts","insect_P.qr","bird_H.dw"))
colnames(output) <- colnames(df)

Solution

  • If you don't want to bother with regular expressions, you can use these alternatives. Here's a tidyverse one

    find_any_fixed <- function(x, patterns) {
      purrr::map(patterns, ~stringr::str_detect(x, stringr::fixed(.x))) |> purrr::reduce(`|`)
    }
    

    and here's a base R version

    find_any_fixed <- function(x, patterns) {
      Map(function(.x) grepl(.x, x, fixed=TRUE), patterns) |> Reduce(`|`, x=_)
    }
    

    In both of these solutions I make sure to use the "fixed" option because "." means something special when you are using regular expressions. Since you seem to want to match the period exactly, you need to let the searching tools know that you are not using a regular expression

    You can use this to find matching patterns. These examples assume df is a data.frame (df <- as.data.frame(df)). For example

    find_any_fixed(df$pollinator, abbrevs)
    # [1] FALSE  TRUE  TRUE FALSE  TRUE
    

    And you can subset with it

    # Tidyverse
    df %>% filter(find_any_fixed(pollinator, abbrevs))
    # Base R
    subset(df, find_any_fixed(pollinator, abbrevs))