Search code examples
rdplyrinner-joinsummarize

Summarize while joining in R


I have two datasets and I want to join the two datasets and apply a summarize command at the same time.

Example data:

Data 1: We observe three products (id) at three points in time (obs_id) and the number of reviews on this product (n_join).

product_data = data.frame(id = c(rep("product1", 3), rep("product2", 3),rep("product3", 3)), obs_id = rep(c(1,2,3), 3), n_join = c(1,3,5,1,1,2,1,2,4))

product_data
        id obs_id n_join
1 product1      1      1
2 product1      2      3
3 product1      3      5
4 product2      1      1
5 product2      2      1
6 product2      3      2
7 product3      1      1
8 product3      2      2
9 product3      3      4

Data 2: We observe the same three products (id) and each row reflects one review (review_id) and for each review a binary indicator (1 = yes) if it includes a purchase intention (purchase_intention).

review_data = data.frame(id = c(rep("product1", 5), rep("product2", 2),rep("product3", 4)), 
                          review_id = c(1,2,3,4,5,1,2,1,2,3,4),
                          purchase_intention = c(1,1,1,0,1,0,1,0,0,1,1))
 
review_data
         id review_id purchase_intention
1  product1         1                  1
2  product1         2                  1
3  product1         3                  1
4  product1         4                  0
5  product1         5                  1
6  product2         1                  0
7  product2         2                  1
8  product3         1                  0
9  product3         2                  0
10 product3         3                  1
11 product3         4                  1

Now I want to join the review data to the product data in the following way: I want to create a new column in product_data that indicates the number of reviews including a purchase intention.

Examples:

  • To create the new "sum_purchase_intention" column for the first row of product_data, I need to take the value of the first row (indicated by n_join in product_data) of the purchase_intention column in review_data: 1.
  • To create the new "sum_purchase_intention" column for the second row of product_data, I need to take the value of the first and the second row (indicated by n_join in product_data) of the purchase_intention column in review_data: 1 + 1 + 1 = 3.
  • To create the new "sum_purchase_intention" column for the third row of product_data, I need to take the value of rows 1-5 (indicated by n_join in product_data) of the purchase_intention column in review_data: 1 + 1 + 1 + 0 + 1 = 4.

Thus, the expected result looks like this (note that the data needs to be grouped by id):

final_data = data.frame(id = c(rep("product1", 3), rep("product2", 3),rep("product3", 3)), obs_id = rep(c(1,2,3), 3), n_join = c(1,3,5,1,1,2,1,2,4),
                         sum_purchase_intentions = c(1,3,4,0,0,1,0,0,2))
final_data
        id obs_id n_join sum_purchase_intentions
1 product1      1      1                       1
2 product1      2      3                       3
3 product1      3      5                       4
4 product2      1      1                       0
5 product2      2      1                       0
6 product2      3      2                       1
7 product3      1      1                       0
8 product3      2      2                       0
9 product3      3      4                       2

I thought this should be doable with a combination of the inner_join and summarize commands of the dplyr package, but I was not succesful in combining them. Can anyone help out?


Solution

  • library(dplyr); library(tidyr)
    product_data %>%
      left_join(review_data, by = "id") %>%
      filter(n_join >= review_id) %>%
      count(id, obs_id, n_join, wt = purchase_intention)
    
    #        id obs_id n_join n
    #1 product1      1      1 1
    #2 product1      2      3 3
    #3 product1      3      5 4
    #4 product2      1      1 0
    #5 product2      2      1 0
    #6 product2      3      2 1
    #7 product3      1      1 0
    #8 product3      2      2 0
    #9 product3      3      4 2
    

    This is an example of a "non-equi join", where you want to join each row in the original table with a varying window of data from the second table. dplyr does not currently offer that function, but for small data it is often workable to do a join on all the potentially relevant data and then filter out the items outside the window.

    To do this more directly, fuzzyjoin, sqldf, and data.table offer non-equi joins. For example, this code with fuzzyjoin will be more efficient for large data sets, since it will only join review_data which is relevant to the given id and n_join number. This will reduce the tendency for the intermediate data set to explode in size when there are many reviews for each id.

    library(fuzzyjoin)
    product_data %>%
      fuzzy_left_join(review_data, 
                      by = c("id" = "id", "n_join" = "review_id"),
                      match_fun = list(`==`, `>=`)) %>%
      count(id = id.x, obs_id, n_join, wt = purchase_intention)