I have two datasets and I want to join the two datasets and apply a summarize command at the same time.
Example data:
Data 1: We observe three products (id) at three points in time (obs_id) and the number of reviews on this product (n_join).
product_data = data.frame(id = c(rep("product1", 3), rep("product2", 3),rep("product3", 3)), obs_id = rep(c(1,2,3), 3), n_join = c(1,3,5,1,1,2,1,2,4))
product_data
id obs_id n_join
1 product1 1 1
2 product1 2 3
3 product1 3 5
4 product2 1 1
5 product2 2 1
6 product2 3 2
7 product3 1 1
8 product3 2 2
9 product3 3 4
Data 2: We observe the same three products (id) and each row reflects one review (review_id) and for each review a binary indicator (1 = yes) if it includes a purchase intention (purchase_intention).
review_data = data.frame(id = c(rep("product1", 5), rep("product2", 2),rep("product3", 4)),
review_id = c(1,2,3,4,5,1,2,1,2,3,4),
purchase_intention = c(1,1,1,0,1,0,1,0,0,1,1))
review_data
id review_id purchase_intention
1 product1 1 1
2 product1 2 1
3 product1 3 1
4 product1 4 0
5 product1 5 1
6 product2 1 0
7 product2 2 1
8 product3 1 0
9 product3 2 0
10 product3 3 1
11 product3 4 1
Now I want to join the review data to the product data in the following way: I want to create a new column in product_data that indicates the number of reviews including a purchase intention.
Examples:
Thus, the expected result looks like this (note that the data needs to be grouped by id):
final_data = data.frame(id = c(rep("product1", 3), rep("product2", 3),rep("product3", 3)), obs_id = rep(c(1,2,3), 3), n_join = c(1,3,5,1,1,2,1,2,4),
sum_purchase_intentions = c(1,3,4,0,0,1,0,0,2))
final_data
id obs_id n_join sum_purchase_intentions
1 product1 1 1 1
2 product1 2 3 3
3 product1 3 5 4
4 product2 1 1 0
5 product2 2 1 0
6 product2 3 2 1
7 product3 1 1 0
8 product3 2 2 0
9 product3 3 4 2
I thought this should be doable with a combination of the inner_join and summarize commands of the dplyr package, but I was not succesful in combining them. Can anyone help out?
library(dplyr); library(tidyr)
product_data %>%
left_join(review_data, by = "id") %>%
filter(n_join >= review_id) %>%
count(id, obs_id, n_join, wt = purchase_intention)
# id obs_id n_join n
#1 product1 1 1 1
#2 product1 2 3 3
#3 product1 3 5 4
#4 product2 1 1 0
#5 product2 2 1 0
#6 product2 3 2 1
#7 product3 1 1 0
#8 product3 2 2 0
#9 product3 3 4 2
This is an example of a "non-equi join", where you want to join each row in the original table with a varying window of data from the second table. dplyr
does not currently offer that function, but for small data it is often workable to do a join on all the potentially relevant data and then filter out the items outside the window.
To do this more directly, fuzzyjoin
, sqldf
, and data.table
offer non-equi joins. For example, this code with fuzzyjoin
will be more efficient for large data sets, since it will only join review_data
which is relevant to the given id
and n_join
number. This will reduce the tendency for the intermediate data set to explode in size when there are many reviews for each id
.
library(fuzzyjoin)
product_data %>%
fuzzy_left_join(review_data,
by = c("id" = "id", "n_join" = "review_id"),
match_fun = list(`==`, `>=`)) %>%
count(id = id.x, obs_id, n_join, wt = purchase_intention)