I have two datasets that are formatted like this:
df1
#> Artist Album Year
#> 1 Beatles Sgt. Pepper's 1967
#> 2 Rolling Stones Sticky Fingers 1971
and
df2
#> Album Year Producer
#> 1 Sgt. Pepper's 1966 George Martin
#> 2 Sticky Fingers 1971 Jimmy Miller
I'd like to do an inner_join
by Album and Year, but sometimes the 'Year' field is off by one year: for example, Sgt. Peppers is listed as 1967 in df1 by 1966 in df2.
So if I run:
df3 <- inner_join(df1, df2, by = c("Album", "Year"))
I get:
df3
#> Artist Album Year Producer
#> 1 Rolling Stones Sticky Fingers 1971 Jimmy Miller
Whereas, I would like both albums to join, so long as something like (df1$Year == df2$Year + 1)|(df1$Year == df2$Year - 1)
.
I cannot just simple join by 'Album' because in my real data set there are some identically titled 'Albums' that are distinguished by 'Year'.
Code for the datasets below:
df1 <- data.frame(stringsAsFactors=FALSE,
Artist = c("Beatles", "Rolling Stones"),
Album = c("Sgt. Pepper's", "Sticky Fingers"),
Year = c(1967, 1971)
)
df1
df2 <- data.frame(stringsAsFactors=FALSE,
Album = c("Sgt. Pepper's", "Sticky Fingers"),
Year = c(1966, 1971),
Producer = c("George Martin", "Jimmy Miller")
)
df2
We could try using the sqldf
package here, since your requirement can easily be phrased using a SQL join:
library(sqldf)
sql <- "SELECT t1.Artist, t1.Album, t1.Year, t2.Album, t2.Year, t2.Producer
FROM df1 t1
INNER JOIN df2 t2
ON ABS(t1.Year - t2.Year) <= 1"
df3 <- sqldf(sql)
If you want to select all fields from both tables, then use:
SELECT t1.*, t2.* FROM ...
But note that in general SELECT *
is frowned upon, and it is desirable to always list out the columns to select.