Joining/grouping in R

I have 2 datasets like this: Fruits

ID	Apples	Oranges	Pears
1	0	1	1
2	1	0	0
3	1	1	0
4	0	0	1
5	1	0	0

This dataset represents if a person with that ID has that fruit(1) or not(0). Here ID is the primary key.

Another dataset is Juice. This table represents juice made by that ID on the given date. There are no duplicates in this dataset.

ID	Dates
1	8/12/2021
1	6/9/2020
2	7/14/2020
2	3/6/2021
2	5/2/2020
3	8/31/2021
5	9/21/2020

My desired output would be to know which fruit was used how many times. If an Id has more than 1 fruit, consider he used both the fruits to make the juice.

Let's follow column-wise- Apples- ID 2, ID 3 and ID 5 has apples. ID 2 made juice 3 times, ID 3 made juice 1 time and ID 3 made juice 1 time, so apple was used 5 times(3+1+1). Similarly, ID 1 and ID 3 has oranges. ID 1 made juice 2 times and ID 3 made juice 1 time, so orange was used 3 times(2+1). ID 1 made juice 2 times, and ID 4 made juice 0 times, so pear was used 2 times.

Fruit	Count
Apples	5
Oranges	3
Pears	2

I want this in R, Python or SQL, though I think R has the best functions to approach this problem. I am not really sure how to approach this as there are two tables involved. Any help would be really appreciated.

Solution

R: base

tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

R: dplyr

library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
  pivot_longer(-ID, names_to = "Fruit") %>%
  right_join(Juice, by = "ID") %>%
  filter(value > 0) %>%
  count(Fruit)
# # A tibble: 3 x 2
#   Fruit       n
#   <chr>   <int>
# 1 Apples      5
# 2 Oranges     3
# 3 Pears       2

R: data.table

library(data.table)
JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
             ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
     measure.vars = patterns("."),
     variable.name = "Fruit", value.name = "Count")
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

Alternative, more aligned with the dplyr solution above:

melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
  )[JuiceDT, on = .(ID)
  ][, .(Count = sum(value)), by = Fruit]
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

SQL (via R's `sqldf`)

sqldf::sqldf(
  "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
   select 'Apples' as Fruit, sum(Apples) as Count from cte
   union all
   select 'Oranges' as Fruit, sum(Oranges) as Count from cte
   union all
   select 'Pears' as Fruit, sum(Pears) as Count from cte
")
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

This instance is using the SQLite engine, which does not support PIVOT. There are other sqldf engines that may support it, and doing "raw SQL" to other DBMSes should allow one to pivot more naturally within its dialect.

R data

Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))

Joining/grouping in R

R: base

R: dplyr

R: data.table

SQL (via R's sqldf)

SQL (via R's `sqldf`)