Search code examples
rdata-manipulationdata-cleaningdata-processing

How to Transform Data to Find Index with Same Value


I intend to find customers who have bought exactly the same products, This is the output that I prefer. I have also provided the dput code below.

The data I have is customers' behaviors--what they have bought. This will be my input (only two columns)

The example that I provided is a simplified version of my data. Customers will usually buy 10 to 20 products. There are around 50 products that consumers could choose to buy.

I am really confused what is an easy way to transform my data into the output that I prefer. Could you please give me any advice? Thanks

Input:

structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper", 
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper", 
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID", 
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer", 
"collector")), Products = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Output:

structure(list(`Products Bought` = c("Apple, Beer, Diaper", "Apple, Diaper"
), Customer_ID = c("1, 3, 6", "4, 5")), .Names = c("Products Bought", 
"Customer_ID"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-2L), spec = structure(list(cols = structure(list(`Products Bought` = structure(list(), class = c("collector_character", 
"collector")), Customer_ID = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Products Bought", "Customer_ID")), 
    default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))

Solution

  • I am suspicious that you may want to look at structuring your data in a way that is more usable. In any case, the tidyverse can be a helpful way of thinking through your task.

    As mentioned, posting code for others to start with can save them time and get you an answer faster.

    library(dplyr)
    library(stringr)
    library(tidyr)
    
    d <- data_frame(id=c(1,2,3,4,5,6)
         , bought=c('Apple, Beer, Diaper','Apple, Beer', 'Apple, Beer, Diaper, Diaper'
                   , 'Apple, Diaper', 'Diaper, Apple', 'Apple, Diaper, Beer, Beer'))
    
    d %>% 
    ## Unnest the values & take care of white space
    ## - This is the better data structure to have, anyways
    mutate(buy=str_split(bought,',')) %>% 
    unnest(buy) %>% mutate(buy=str_trim(buy)) %>% select(-bought) %>%
    
    ## Get distinct (and sort?)
    distinct(id, buy) %>% arrange(id, buy) %>%
    
    ## Aggregate by id
    group_by(id) %>% summarize(bought=paste(buy,collapse=', ')) %>% ungroup %>%
    
    ## Count
    group_by(bought) %>% summarize(ids=paste(id,collapse=',')) %>% ungroup
    

    EDIT: referencing this SO post for getting distinct combinations faster / cleaner in dplyr