Search code examples
rtransactionsarules

R arules preparing dataset for transactions


I prepared a data set for reading it as transactions using arules package in R. however, one of my data pre-processing is causing an issue when I use the command itemFrequencyplot, specifically, the highest frequency item is " ". Would anyone have any suggestions to resolve this issue?

Original data:

data <- as.data.frame(matrix(NA, nrow = 10, ncol = 3))
colnames(data) <- c("Customer", "OrderDate", "Product")
data$Customer <- c("John", "John", "John", "Tom", "Tom", "Tom", "Sally", "Sally", "Sally", "Sally")
data$OrderDate <- c("1-Oct", "2-Oct", "2-Oct", "2-Oct","2-Oct", "2-Oct", "3-Oct", "3-Oct", "3-Oct", "3-Oct")
data$Product <- c("Milk", "Eggs", "Bread", "Butter", "Eggs", "Milk", "Bread", "Butter", "Eggs", "Wine")

I make the following transformation

library(reshape2)
library(dplyr)

newdata <- data  %>% 
  group_by(Customer, OrderDate) %>%
  mutate(ProductValue = paste0("Product", 1:n()) ) %>%
  dcast(Customer + OrderDate ~ ProductValue, value.var = "Product") %>%
  arrange(OrderDate)

newdata[is.na(newdata)] <- " "
newdata <- newdata[ , 3:6]
newdata[sapply(newdata, is.character)] <- lapply(newdata[sapply(newdata, is.character)], as.factor) #converting is.character columns into as.factor

used write.table to create csv file without column names for reading via arules

write.table(newdata, "transactions.csv", row.names = FALSE, col.names = FALSE, sep = ",") 

using arules package to read the csv file as transactions

library(arules)

transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket") 

does not work - throws an error and after reading previous queries on stackoverflow, I was able to resolve it as follows

transactiondata <- read.transactions("transactions.csv", sep = ",", format = "basket", rm.duplicates = TRUE)

itemFrequencyPlot(transactiondata, topN = 5)

the result of this plot has " " as the top frequency item, which in reality is not the case and is a result of my data pre-processing. Suggestions to resolve it would be greatly appreciated!


Solution

  • I would do it this way (following the examples in the manual page for transactions):

    data_list <- split(data$Product, paste(data$OrderDate, data$Customer))
    trans <- as(data_list, "transactions")
    inspect(trans)
    
        items                    transactionID
    [1] {Milk}                   1-Oct John   
    [2] {Bread,Eggs}             2-Oct John   
    [3] {Butter,Eggs,Milk}       2-Oct Tom    
    [4] {Bread,Butter,Eggs,Wine} 3-Oct Sally
    
    itemFrequencyPlot(trans, topN = 5)
    

    Hope this helps!