I intend to find customers who have bought exactly the same products,
The data I have is customers' behaviors--what they have bought.
The example that I provided is a simplified version of my data. Customers will usually buy 10 to 20 products. There are around 50 products that consumers could choose to buy.
I am really confused what is an easy way to transform my data into the output that I prefer. Could you please give me any advice? Thanks
Input:
structure(list(Customer_ID = 1:6, Products = c("Apple, Beer, Diaper",
"Beer, Apple", "Beer, Apple, Diaper, Diaper", "Apple, Diaper",
"Diaper, Apple", "Apple, Diaper, Beer, Beer")), .Names = c("Customer_ID",
"Products"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), spec = structure(list(cols = structure(list(Customer_ID = structure(list(), class = c("collector_integer",
"collector")), Products = structure(list(), class = c("collector_character",
"collector"))), .Names = c("Customer_ID", "Products")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
Output:
structure(list(`Products Bought` = c("Apple, Beer, Diaper", "Apple, Diaper"
), Customer_ID = c("1, 3, 6", "4, 5")), .Names = c("Products Bought",
"Customer_ID"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L), spec = structure(list(cols = structure(list(`Products Bought` = structure(list(), class = c("collector_character",
"collector")), Customer_ID = structure(list(), class = c("collector_character",
"collector"))), .Names = c("Products Bought", "Customer_ID")),
default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
I am suspicious that you may want to look at structuring your data in a way that is more usable. In any case, the tidyverse
can be a helpful way of thinking through your task.
As mentioned, posting code for others to start with can save them time and get you an answer faster.
library(dplyr)
library(stringr)
library(tidyr)
d <- data_frame(id=c(1,2,3,4,5,6)
, bought=c('Apple, Beer, Diaper','Apple, Beer', 'Apple, Beer, Diaper, Diaper'
, 'Apple, Diaper', 'Diaper, Apple', 'Apple, Diaper, Beer, Beer'))
d %>%
## Unnest the values & take care of white space
## - This is the better data structure to have, anyways
mutate(buy=str_split(bought,',')) %>%
unnest(buy) %>% mutate(buy=str_trim(buy)) %>% select(-bought) %>%
## Get distinct (and sort?)
distinct(id, buy) %>% arrange(id, buy) %>%
## Aggregate by id
group_by(id) %>% summarize(bought=paste(buy,collapse=', ')) %>% ungroup %>%
## Count
group_by(bought) %>% summarize(ids=paste(id,collapse=',')) %>% ungroup
EDIT: referencing this SO post for getting distinct combinations faster / cleaner in dplyr