I have multiple columns of addresses, where they may contain duplicated information (but generally will not have exactly duplicated information).
The following code will provide an example of my issue,
id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2)
df$combined = paste(add1, add2)
df
This gives the following result,
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH ST EAST BLAH BLVD
The conclusion I need is the following,
id add1 add2 combined
1 1 21ST AVE 21ST AVE BLAH ST 21ST AVE BLAH ST
2 2 5TH ST EAST BLAH BLVD 5TH ST EAST BLAH BLVD
I wish to identify if what's in add1
is contained in add2
. If I find that add2
contains the same information that add1
provides, then I either want to avoid combining those particular column values or delete the repeated information in the combined column (which I believe would require solving a different issue of repeated phrases in a string). I have not been able to find an example of finding column values that are 'contained in' rather than 'exact' - and I'm working with over 500K cases in a dataset where this issue is a common occurrence. Any help is appreciated.
We split
the second and third column by one or more space (\\s+
), then paste
the union
of the corresponding rows with mapply
to create the 'combined'
lst <- lapply(df[2:3], function(x) strsplit(as.character(x), "\\s+"))
df$combined <- mapply(function(x,y) paste(union(x, y), collapse=" "), lst$add1, lst$add2)
df$combined
#[1] "21ST AVE BLAH ST" "5TH ST EAST BLAH BLVD"
Or another option is gsub
gsub("((\\w+\\s*){2,})\\1", "\\1", do.call(paste, df[2:3]))
#[1] "21ST AVE BLAH ST" "5TH ST EAST BLAH BLVD"