Search code examples
rcategorical-datadata-wranglingdata-preprocessing

How to create a new data frame with grouped transactions in R?


I am trying to create a new data frame in R using an existing data frame of items bought in transactions as shown below:

dput output for the data:

structure(list(Transaction = c(1L, 2L, 2L, 3L, 3L, 3L), Item = c("Bread", 
"Scandinavian", "Scandinavian", "Hot chocolate", "Jam", "Cookies"
), date_time = c("30/10/2016 09:58", "30/10/2016 10:05", "30/10/2016 10:05", 
"30/10/2016 10:07", "30/10/2016 10:07", "30/10/2016 10:07"), 
    period_day = c("morning", "morning", "morning", "morning", 
    "morning", "morning"), weekday_weekend = c("weekend", "weekend", 
    "weekend", "weekend", "weekend", "weekend"), Year = c("2016", 
    "2016", "2016", "2016", "2016", "2016"), Month = c("October", 
    "October", "October", "October", "October", "October")), row.names = c(NA, 
6L), class = "data.frame")

As you can see in the example, the rows are due to each individual product bought, not the transactions themselves (hence why Transaction 2 is both rows 2 and 3).

I would like to make a new table where the rows are the different transactions (1, 2, 3, etc.) and the different columns are categorical (Bread = 0, 1) so I can perform apriori analysis.

Any idea how I can group the different transactions together and then create these new columns?


Solution

  • Assuming your dataframe is called df you can use tidyr's pivot_wider :

    df1 <- tidyr::pivot_wider(df, names_from = Item, values_from = Item, 
                              values_fn = n_distinct, values_fill = 0)
    
    df1
    
    #  Transaction date_time      period_day weekday_weekend Year  Month  Bread Scandinavian `Hot chocolate`   Jam Cookies
    #        <int> <chr>          <chr>      <chr>           <chr> <chr>  <int>        <int>           <int> <int>   <int>
    #1           1 30/10/2016 09… morning    weekend         2016  Octob…     1            0               0     0       0
    #2           2 30/10/2016 10… morning    weekend         2016  Octob…     0            1               0     0       0
    #3           3 30/10/2016 10… morning    weekend         2016  Octob…     0            0               1     1       1
    

    Or with data.table's dcast :

    library(data.table)
    dcast(setDT(df), Transaction+date_time+period_day + weekday_weekend + 
          Year + Month ~ Item, value.var = 'Item', fun.aggregate = uniqueN)