r data-manipulation data-cleaning data-processing

How to Transform Data to Find Index with Same Value

I intend to find customers who have bought exactly the same products, This is the output that I prefer. I have also provided the dput code below.

The data I have is customers' behaviors--what they have bought. This will be my input (only two columns)

The example that I provided is a simplified version of my data. Customers will usually buy 10 to 20 products. There are around 50 products that consumers could choose to buy.

I am really confused what is an easy way to transform my data into the output that I prefer. Could you please give me any advice? Thanks

Input:

structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper", 
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper", 
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID", 
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer", 
"collector")), Products = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Output:

structure(list(`Products Bought` = c("Apple, Beer, Diaper", "Apple, Diaper"
), Customer_ID = c("1, 3, 6", "4, 5")), .Names = c("Products Bought", 
"Customer_ID"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-2L), spec = structure(list(cols = structure(list(`Products Bought` = structure(list(), class = c("collector_character", 
"collector")), Customer_ID = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("Products Bought", "Customer_ID")), 
    default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec"))

Solution

I am suspicious that you may want to look at structuring your data in a way that is more usable. In any case, the tidyverse can be a helpful way of thinking through your task.

As mentioned, posting code for others to start with can save them time and get you an answer faster.

library(dplyr)
library(stringr)
library(tidyr)

d <- data_frame(id=c(1,2,3,4,5,6)
     , bought=c('Apple, Beer, Diaper','Apple, Beer', 'Apple, Beer, Diaper, Diaper'
               , 'Apple, Diaper', 'Diaper, Apple', 'Apple, Diaper, Beer, Beer'))

d %>% 
## Unnest the values & take care of white space
## - This is the better data structure to have, anyways
mutate(buy=str_split(bought,',')) %>% 
unnest(buy) %>% mutate(buy=str_trim(buy)) %>% select(-bought) %>%

## Get distinct (and sort?)
distinct(id, buy) %>% arrange(id, buy) %>%

## Aggregate by id
group_by(id) %>% summarize(bought=paste(buy,collapse=', ')) %>% ungroup %>%

## Count
group_by(bought) %>% summarize(ids=paste(id,collapse=',')) %>% ungroup

EDIT: referencing this SO post for getting distinct combinations faster / cleaner in dplyr