I have a data frame with three columns that captures transactional data, including CustomerName, OrderDate and name of the Product(s) that have been purchased. I have to transform the dataframe into another dataframe with a format such that all the items purchased by a customer on a single date are in one row.
As I'm dealing with a large data set, is there an efficient way to this transformation, hopefully without using for loops.
Additionally, the number of columns for the products in the data frame must be equal to the maximum number of products bought by any customer on any single day. Please find an example of data frame before and after the transformation
Original Data:
data <- data.frame(Customer = c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally"),
OrderDate = c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct"),
Product = c("Milk", "Eggs", "Bread", "Chicken", "Pizza", "Beer", "Salad", "Apples", "Eggs", "Wine"),
stringsAsFactors = FALSE)
# Customer OrderDate Product
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs
# 3 John 2-Oct Bread
# 4 Tom 2-Oct Chicken
# 5 Tom 2-Oct Pizza
# 6 Tom 2-Oct Beer
# 7 Sally 3-Oct Salad
# 8 Sally 3-Oct Apples
# 9 Sally 3-Oct Eggs
# 10 Sally 3-Oct Wine
Post-tranformation:
datatransform <- as.data.frame(matrix(NA, nrow = 4, ncol = 6))
colnames(datatransform) <- c("Customer", "OrderDate", "Product1", "Product2", "Product3", "Product4")
datatransform$Customer <- c("John", "John", "Tom", "Sally")
datatransform$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "3-Oct")
datatransform[1, 3:6] <- c("Milk", "", "", "")
datatransform[2, 3:6 ] <- c("Eggs", "Bread", "", "")
datatransform[3, 3:6 ] <- c("Chicken", "Pizza", "Beer", "")
datatransform[4, 3:6 ] <- c("Salad", "Apples", "Eggs", "Wine")
# Customer OrderDate Product1 Product2 Product3 Product4
# 1 John 1-Oct Milk
# 2 John 2-Oct Eggs Bread
# 3 Tom 2-Oct Chicken Pizza Beer
# 4 Sally 3-Oct Salad Apples Eggs Wine
Additionally, the number of columns for the products in the data frame must be equal to the maximum number of products bought by any customer on any single day.
Since you talked about large dataset (then efficiency is a very important issue to toke into consideration), here is a dplyr and reshape2 solution:
library(reshape2)
library(dplyr)
data %>% group_by(Customer, OrderDate) %>%
mutate(ProductValue = paste0("Product", 1:n()) ) %>%
dcast(Customer + OrderDate ~ ProductValue, value.var = "Product" ) %>%
arrange(OrderDate)
Customer OrderDate Product1 Product2 Product3 Product4
1 John 1-Oct Milk <NA> <NA> <NA>
2 John 2-Oct Eggs Bread <NA> <NA>
3 Tom 2-Oct Chicken Pizza Beer <NA>
4 Sally 3-Oct Salad Apples Eggs Wine