Search code examples
rjoingroup-bydplyrsummarize

Merge two tables based on 2 conditions and output the average as result column


I have the following two tables:

Table_1
    ID  Interval
    1   10
    1   11
    2   11

and

Table_2
   ID   Interval    Rating
    1   10          0.5
    1   10          0.3
    1   11          0.1
    2   11          0.1
    2   11          0.2

The output table should look like this:

ID  Interval    Mean Ratings
1   10          0.4
1   11          0.1
2   11          0.15

My goal is to join both tables based on the two conditions/columns ID and interval. Given that I have several ratings for the same ID and interval, I want to compute the mean of the ratings. Whereas the IDs are unique (~9500), the interval repeats for different IDs (as seen in the table above). My current approach is the join function with 2 by arguments. How can I create a final table in which Table_1 and Table_2 are joined based on the condition ID and interval, and receive the average rating in the result column?

left_join(Table_1, Table_2, by = c("ID" = "ID", "Interval" = "Interval"))

Solution

  • First of all you would need to summarize second table DT2 and then perform a right join with first table DT1.

    library(data.table)
    
    DT1[DT2[, .(Mean_Rating = mean(Rating)), .(ID, Interval)], on = c(ID = "ID", Interval = "Interval")]
    

    which gives

       ID Interval Mean_Rating
    1:  1       10        0.40
    2:  1       11        0.10
    3:  2       11        0.15
    


    Sample data:

    DT1 <- structure(list(ID = c(1L, 1L, 2L), Interval = c(10L, 11L, 11L
    )), .Names = c("ID", "Interval"), class = c("data.table", "data.frame"
    ), row.names = c(NA, -3L))
    
    DT2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Interval = c(10L, 
    10L, 11L, 11L, 11L), Rating = c(0.5, 0.3, 0.1, 0.1, 0.2)), .Names = c("ID", 
    "Interval", "Rating"), class = c("data.table", "data.frame"), row.names = c(NA, 
    -5L))