Search code examples
rfuzzyjoin

Merge two data frames in R by variable that is regular expression in one and string in other


I have two data frames I would like to merge

a<- data.frame(x=c(1,4,6,8,1,6,7,2),ID=c("132","14.","732","2..","132","14.","732","2.."),year=c(1,1,1,1,2,2,2,2))

b<- data.frame(y=c(2,7,5,5,1,1,2,3),ID=c("132","144","732","290","132","144","732","290"),year=c(1,1,1,1,2,2,2,2))

The ID Variable by which I would like to merge the two data frames is not completely known in data set a. I also want to merge by year. They are known up to a fully identifying regular expression. Note, there is a one-to-one match. In this example you would not find an ID "1.." in the data set so there are no ambiguous matches.

I would like to get something like this:

output<-data.frame(y=c(2,7,5,5,1,1,2,3),x=c(1,4,6,8,1,6,7,2),ID=c("132","144","732","290","132","144","732","290"), year=c(1,1,1,1,2,2,2,2))

I tried to remove the regular expression part with substr and then use starts_with in the merge but it does not work.

I get the following error message

Coercing pattern to a plain character vector

when I do this:

df_complete <- regex_inner_join(b,a, by=c("ID","year"))

Thank you stack overflow...


Solution

  • Answer from Comment by @jblood94

    With a and b as data.tables: a[, regex_inner_join(b[year == .BY], .SD, by = "ID"), year] – jblood94