Search code examples
rstringstring-matching

Rowwise extract common substrings from to columns in a data frame


I want to match cities with regions in a data frame. The columns are a little bit messy, so I would like to extract the names of the cities / regions that appear in two columns as in the following example.

A <- c("Berlin",
            "Hamburg",
            "Munich",
            "Stuttgart",
            "Rhein Main Frankfurt",
            "Hannover")

B <- c("Berlin Brandenburg",
             "Hamburg",
             "Munich Bayern",
             "Region Stuttgart",
             "Main Rhein Darmstadt",
             "Wiesbaden")

The resulting column / data frame should look like this:

result <- c("Berlin",
            "Hamburg",
            "Munich",
            "Stuttgart",
            "Rhein Main",
            NA
            )

df <- data.frame(A, B, result)

...while it doesn't matter whether it's "Rhein Main" or "Main Rhein".

Thank you for your help!


Solution

  • Maybe I am missing a smart regex trick but one option would be to split strings into words and find the common words using intersect.

    df$Result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "), 
                        strsplit(df$A, '\\s+'), strsplit(df$B, '\\s+'))
    df
    
    #                     A                    B     Result
    #1               Berlin   Berlin Brandenburg     Berlin
    #2              Hamburg              Hamburg    Hamburg
    #3               Munich        Munich Bayern     Munich
    #4            Stuttgart     Region Stuttgart  Stuttgart
    #5 Rhein Main Frankfurt Main Rhein Darmstadt Rhein Main
    #6             Hannover            Wiesbaden           
    

    This returns empty string when there is no match. You can turn the empty string to NA if needed.