Search code examples
rdplyrfuzzy-search

how to find if text strings in one column are in another column?


Below is the sample data

 df1 <- c ("Board of Accountancy", "Board of Economists", "Board of Medicine"
 df2 <- c ("State Board of Accountancy", "The State Board of Economists", "State Board of Law")

the task at hand is two fold. First, to search df2 for the text strings found in df1. If it is not found in df1 then leave it alone and come to an end result such as this. This is related to a question that I made yesterday but upon closer examination.. my first job is to find if the names in df1 are found in df2.

df3: "State Board of Accountancy", "The State Board of Economists", "State Board of Law", "Board of Medicine"

Solution

  • c(df2, df1[rowSums(sapply(df1, grepl, df2)) < 1])
    # [1] "State Board of Accountancy"    "The State Board of Economists" "State Board of Law"            "Board of Medicine"            
    df3
    # [1] "State Board of Accountancy"    "The State Board of Economists" "State Board of Law"            "Board of Medicine"            
    

    Walk-through:

    • grepl by itself accepts only a single pattern, so we need to iterate over each pattern; we do this with sapply
    • since that (sapply) returns a matrix (each pattern against all of df2), we need to find if anything on a row (each df1) is a match; we do this with rowSums(.) < 1 (aka == 0), meaning that nothing matched; by subsetting df1[..] on this, we get df1 where no matches were found

    Corrected data:

    df1 <- c("Board of Accountancy", "Board of Economists", "Board of Medicine")
    df2 <- c("State Board of Accountancy", "The State Board of Economists", "State Board of Law")
    df3 <- c("State Board of Accountancy", "The State Board of Economists", "State Board of Law", "Board of Medicine")