Search code examples
rpattern-matchingstring-matching

Find matching strings between two vectors in R


I have two vectors in R. I want to find partial matches between them.

My Data

The first one is from a dataset named muc, which contains 6400 street names. muc$name looks like:

muc$name = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße",...)

The other vector is d_vector. It contains around 1400 names.

d_vector = "Abel", "Abendroth", "von Abercron", "Abetz", "Abicht", "Abromeit", ...

I want to find all the street names, that contain a name from d_vector somewhere in the street name.

First, I made some general adaptions after importing the csv data (as variable d):

d_vector <- unlist(d$name) d_vector <- as.vector(as.matrix(d_vector))

What I tried so far

  • Then I tried to find a solution with grep, turning d_vector into containing one long string, separated by | for RegEx-Search:

result <- unique(grep(paste(d_vector, collapse="|"), muc$Name, value=TRUE, ignore.case = TRUE)) result

But the result returns all the street names.

  • I also tried to use agrep, which retuned a Out of memory-Error.

  • When I tried d_vector %in% muc$nameit returned just one TRUE and hundreds of FALSE, which doesn't seem right.

Do you have any suggestion where my mistake could lay or which library I could use? I am looking for something like python's "fuzzywuzzy" for R


Solution

  • Simple solution:

    streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
    streets = tolower(streets) #Lowercase all
    names = c("Berber", "Weg")
    names = tolower(names)
    
    sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))
    
    #                   berber   weg
    #berberichweg        TRUE  TRUE
    #otto-klemperer-weg  FALSE TRUE
    #feldmeierbogen      FALSE FALSE
    #altostraße          FALSE FALSE