I have two datasets that have two address columns. I want to merge the two datasets by the common address. But some of the addresses are intersections and the order of the street names is different in each dataset. Is there a way of asking Rstudio to compare the strings word by word and if there are more than two words matching, let me know? An example is:
"CABOT ST AT RUGGLES ST" and "RUGGLES ST AT CABOT ST"
This code reorders the words in the addresses in alphabetical order, so then you can test if two addresses are the same
library(stringr)
df = data.frame(address = c("CABOT ST AT RUGGLES ST", "RUGGLES ST AT CABOT ST"))
# split the address into words
list_split <- str_split(df$address,' ')
#[[1]]
#[1] "CABOT" "ST" "AT" "RUGGLES" "ST"
#[[2]]
#[1] "RUGGLES" "ST" "AT" "CABOT" "ST"
# sort the words
list_sort <- map(list_split, sort)
#[[1]]
#[1] "AT" "CABOT" "RUGGLES" "ST" "ST"
#[[2]]
#[1] "AT" "CABOT" "RUGGLES" "ST" "ST"
# paste all the words reordered together
list_pasted <- map(list_sort,function(x) paste(x,collapse= " "))
#[[1]]
#[1] "AT CABOT RUGGLES ST ST"
#[[2]]
# [1] "AT CABOT RUGGLES ST ST"
# unlist to convert to vector and assign to a new column
df$address_sorted <- unlist(list_pasted)
# address address_sorted
#1 CABOT ST AT RUGGLES ST AT CABOT RUGGLES ST ST
#2 RUGGLES ST AT CABOT ST AT CABOT RUGGLES ST ST
If you have two address columns you can do the same for the other column and compare them together