I have 2 datasets like this: Fruits
ID | Apples | Oranges | Pears |
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 0 | 0 |
3 | 1 | 1 | 0 |
4 | 0 | 0 | 1 |
5 | 1 | 0 | 0 |
This dataset represents if a person with that ID has that fruit(1) or not(0). Here ID is the primary key.
Another dataset is Juice. This table represents juice made by that ID on the given date. There are no duplicates in this dataset.
ID | Dates |
---|---|
1 | 8/12/2021 |
1 | 6/9/2020 |
2 | 7/14/2020 |
2 | 3/6/2021 |
2 | 5/2/2020 |
3 | 8/31/2021 |
5 | 9/21/2020 |
My desired output would be to know which fruit was used how many times. If an Id has more than 1 fruit, consider he used both the fruits to make the juice.
Let's follow column-wise- Apples- ID 2, ID 3 and ID 5 has apples. ID 2 made juice 3 times, ID 3 made juice 1 time and ID 3 made juice 1 time, so apple was used 5 times(3+1+1). Similarly, ID 1 and ID 3 has oranges. ID 1 made juice 2 times and ID 3 made juice 1 time, so orange was used 3 times(2+1). ID 1 made juice 2 times, and ID 4 made juice 0 times, so pear was used 2 times.
Fruit | Count |
---|---|
Apples | 5 |
Oranges | 3 |
Pears | 2 |
I want this in R, Python or SQL, though I think R has the best functions to approach this problem. I am not really sure how to approach this as there are two tables involved. Any help would be really appreciated.
tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
# Fruit Count
# 1 Apples 5
# 2 Oranges 3
# 3 Pears 2
library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
pivot_longer(-ID, names_to = "Fruit") %>%
right_join(Juice, by = "ID") %>%
filter(value > 0) %>%
count(Fruit)
# # A tibble: 3 x 2
# Fruit n
# <chr> <int>
# 1 Apples 5
# 2 Oranges 3
# 3 Pears 2
library(data.table)
JuiceDT <- as.data.table(Juice) # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
measure.vars = patterns("."),
variable.name = "Fruit", value.name = "Count")
# Fruit Count
# <fctr> <int>
# 1: Apples 5
# 2: Oranges 3
# 3: Pears 2
Alternative, more aligned with the dplyr solution above:
melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
)[JuiceDT, on = .(ID)
][, .(Count = sum(value)), by = Fruit]
# Fruit Count
# <fctr> <int>
# 1: Apples 5
# 2: Oranges 3
# 3: Pears 2
sqldf
)sqldf::sqldf(
"with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
select 'Apples' as Fruit, sum(Apples) as Count from cte
union all
select 'Oranges' as Fruit, sum(Oranges) as Count from cte
union all
select 'Pears' as Fruit, sum(Pears) as Count from cte
")
# Fruit Count
# 1 Apples 5
# 2 Oranges 3
# 3 Pears 2
This instance is using the SQLite engine, which does not support PIVOT
. There are other sqldf
engines that may support it, and doing "raw SQL" to other DBMSes should allow one to pivot more naturally within its dialect.
R data
Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))