I want to match cities with regions in a data frame. The columns are a little bit messy, so I would like to extract the names of the cities / regions that appear in two columns as in the following example.
A <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main Frankfurt",
"Hannover")
B <- c("Berlin Brandenburg",
"Hamburg",
"Munich Bayern",
"Region Stuttgart",
"Main Rhein Darmstadt",
"Wiesbaden")
The resulting column / data frame should look like this:
result <- c("Berlin",
"Hamburg",
"Munich",
"Stuttgart",
"Rhein Main",
NA
)
df <- data.frame(A, B, result)
...while it doesn't matter whether it's "Rhein Main" or "Main Rhein".
Thank you for your help!
Maybe I am missing a smart regex trick but one option would be to split strings into words and find the common words using intersect
.
df$Result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "),
strsplit(df$A, '\\s+'), strsplit(df$B, '\\s+'))
df
# A B Result
#1 Berlin Berlin Brandenburg Berlin
#2 Hamburg Hamburg Hamburg
#3 Munich Munich Bayern Munich
#4 Stuttgart Region Stuttgart Stuttgart
#5 Rhein Main Frankfurt Main Rhein Darmstadt Rhein Main
#6 Hannover Wiesbaden
This returns empty string when there is no match. You can turn the empty string to NA
if needed.