Search code examples
pythonsqlrpivot-tablegrouping

Joining/grouping in R


I have 2 datasets like this: Fruits

ID Apples Oranges Pears
1 0 1 1
2 1 0 0
3 1 1 0
4 0 0 1
5 1 0 0

This dataset represents if a person with that ID has that fruit(1) or not(0). Here ID is the primary key.

Another dataset is Juice. This table represents juice made by that ID on the given date. There are no duplicates in this dataset.

ID Dates
1 8/12/2021
1 6/9/2020
2 7/14/2020
2 3/6/2021
2 5/2/2020
3 8/31/2021
5 9/21/2020

My desired output would be to know which fruit was used how many times. If an Id has more than 1 fruit, consider he used both the fruits to make the juice.

Let's follow column-wise- Apples- ID 2, ID 3 and ID 5 has apples. ID 2 made juice 3 times, ID 3 made juice 1 time and ID 3 made juice 1 time, so apple was used 5 times(3+1+1). Similarly, ID 1 and ID 3 has oranges. ID 1 made juice 2 times and ID 3 made juice 1 time, so orange was used 3 times(2+1). ID 1 made juice 2 times, and ID 4 made juice 0 times, so pear was used 2 times.

Fruit Count
Apples 5
Oranges 3
Pears 2

I want this in R, Python or SQL, though I think R has the best functions to approach this problem. I am not really sure how to approach this as there are two tables involved. Any help would be really appreciated.


Solution

  • R: base

    tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
    data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
    #     Fruit Count
    # 1  Apples     5
    # 2 Oranges     3
    # 3   Pears     2
    

    R: dplyr

    library(dplyr)
    library(tidyr) # pivot_longer
    Fruits %>%
      pivot_longer(-ID, names_to = "Fruit") %>%
      right_join(Juice, by = "ID") %>%
      filter(value > 0) %>%
      count(Fruit)
    # # A tibble: 3 x 2
    #   Fruit       n
    #   <chr>   <int>
    # 1 Apples      5
    # 2 Oranges     3
    # 3 Pears       2
    

    R: data.table

    library(data.table)
    JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
    FruitsDT <- as.data.table(Fruits)
    melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
                 ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
         measure.vars = patterns("."),
         variable.name = "Fruit", value.name = "Count")
    #      Fruit Count
    #     <fctr> <int>
    # 1:  Apples     5
    # 2: Oranges     3
    # 3:   Pears     2
    

    Alternative, more aligned with the dplyr solution above:

    melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
      )[JuiceDT, on = .(ID)
      ][, .(Count = sum(value)), by = Fruit]
    #      Fruit Count
    #     <fctr> <int>
    # 1:  Apples     5
    # 2: Oranges     3
    # 3:   Pears     2
    

    SQL (via R's sqldf)

    sqldf::sqldf(
      "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
       select 'Apples' as Fruit, sum(Apples) as Count from cte
       union all
       select 'Oranges' as Fruit, sum(Oranges) as Count from cte
       union all
       select 'Pears' as Fruit, sum(Pears) as Count from cte
    ")
    #     Fruit Count
    # 1  Apples     5
    # 2 Oranges     3
    # 3   Pears     2
    

    This instance is using the SQLite engine, which does not support PIVOT. There are other sqldf engines that may support it, and doing "raw SQL" to other DBMSes should allow one to pivot more naturally within its dialect.


    R data

    Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
    Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))