Search code examples
rtransformationtransactional

R Data Transformation


I have a data frame with three columns that captures transactional data, including CustomerName, OrderDate and name of the Product(s) that have been purchased. I have to transform the dataframe into another dataframe with a format such that all the items purchased by a customer on a single date are in one row.

As I'm dealing with a large data set, is there an efficient way to this transformation, hopefully without using for loops.

Additionally, the number of columns for the products in the data frame must be equal to the maximum number of products bought by any customer on any single day. Please find an example of data frame before and after the transformation

Original Data:

data <- data.frame(Customer  = c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally"),
                   OrderDate = c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct"),
                   Product   = c("Milk", "Eggs", "Bread", "Chicken", "Pizza", "Beer", "Salad", "Apples", "Eggs", "Wine"),
                   stringsAsFactors = FALSE)

#    Customer OrderDate Product
# 1      John     1-Oct    Milk
# 2      John     2-Oct    Eggs
# 3      John     2-Oct   Bread
# 4       Tom     2-Oct Chicken
# 5       Tom     2-Oct   Pizza
# 6       Tom     2-Oct    Beer
# 7     Sally     3-Oct   Salad
# 8     Sally     3-Oct  Apples
# 9     Sally     3-Oct    Eggs
# 10    Sally     3-Oct    Wine

Post-tranformation:

datatransform <- as.data.frame(matrix(NA, nrow = 4, ncol = 6))
colnames(datatransform) <- c("Customer", "OrderDate", "Product1", "Product2", "Product3", "Product4")
datatransform$Customer <- c("John", "John", "Tom", "Sally")
datatransform$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "3-Oct")
datatransform[1, 3:6] <- c("Milk", "", "", "") 
datatransform[2, 3:6 ] <- c("Eggs", "Bread", "", "")
datatransform[3, 3:6 ] <- c("Chicken", "Pizza", "Beer", "")
datatransform[4, 3:6 ] <- c("Salad", "Apples", "Eggs", "Wine")

#   Customer OrderDate Product1 Product2 Product3 Product4
# 1     John     1-Oct     Milk                           
# 2     John     2-Oct     Eggs    Bread                  
# 3      Tom     2-Oct  Chicken    Pizza     Beer         
# 4    Sally     3-Oct    Salad   Apples     Eggs     Wine

Additionally, the number of columns for the products in the data frame must be equal to the maximum number of products bought by any customer on any single day.


Solution

  • Since you talked about large dataset (then efficiency is a very important issue to toke into consideration), here is a dplyr and reshape2 solution:

    library(reshape2)
    library(dplyr)
    
    data  %>% group_by(Customer, OrderDate) %>%
              mutate(ProductValue = paste0("Product", 1:n()) ) %>%
              dcast(Customer + OrderDate ~ ProductValue, value.var = "Product"  ) %>%
              arrange(OrderDate)
    
      Customer OrderDate Product1 Product2 Product3 Product4
    1     John     1-Oct     Milk     <NA>     <NA>     <NA>
    2     John     2-Oct     Eggs    Bread     <NA>     <NA>
    3      Tom     2-Oct  Chicken    Pizza     Beer     <NA>
    4    Sally     3-Oct    Salad   Apples     Eggs     Wine