Search code examples
rdplyrinner-jointidyversefuzzyjoin

inner_join() with range of values for one of the keys (year)


I have two datasets that are formatted like this:

df1
#>           Artist          Album Year
#> 1        Beatles  Sgt. Pepper's 1967
#> 2 Rolling Stones Sticky Fingers 1971

and

df2
#>            Album Year      Producer
#> 1  Sgt. Pepper's 1966 George Martin
#> 2 Sticky Fingers 1971  Jimmy Miller

I'd like to do an inner_join by Album and Year, but sometimes the 'Year' field is off by one year: for example, Sgt. Peppers is listed as 1967 in df1 by 1966 in df2.

So if I run:

df3 <- inner_join(df1, df2, by = c("Album", "Year"))

I get:

df3
#>           Artist          Album Year     Producer
#> 1 Rolling Stones Sticky Fingers 1971 Jimmy Miller

Whereas, I would like both albums to join, so long as something like (df1$Year == df2$Year + 1)|(df1$Year == df2$Year - 1).

I cannot just simple join by 'Album' because in my real data set there are some identically titled 'Albums' that are distinguished by 'Year'.

Code for the datasets below:

df1 <- data.frame(stringsAsFactors=FALSE,
      Artist = c("Beatles", "Rolling Stones"),
       Album = c("Sgt. Pepper's", "Sticky Fingers"),
        Year = c(1967, 1971)
)
df1

df2 <- data.frame(stringsAsFactors=FALSE,
       Album = c("Sgt. Pepper's", "Sticky Fingers"),
        Year = c(1966, 1971),
    Producer = c("George Martin", "Jimmy Miller")
)
df2

Solution

  • We could try using the sqldf package here, since your requirement can easily be phrased using a SQL join:

    library(sqldf)
    
    sql <- "SELECT t1.Artist, t1.Album, t1.Year, t2.Album, t2.Year, t2.Producer
            FROM df1 t1
            INNER JOIN df2 t2
                ON ABS(t1.Year - t2.Year) <= 1"
    df3 <- sqldf(sql)
    

    If you want to select all fields from both tables, then use:

    SELECT t1.*, t2.* FROM ...
    

    But note that in general SELECT * is frowned upon, and it is desirable to always list out the columns to select.