Search code examples
rdata-miningaprioriarules

Transform a dataframe to a transaction object for the apriori function without exporting and reloading the dataframe


I'm getting in trouble transforming a dataframe object into a transaction object. I create a dataframe grouped by InvoiceNumber and the list of products separated by ',' (the dataframe then contains two columns), everything is ok,

df = read.csv('Orders.csv', sep = ';', stringsAsFactors = T)
    df$Document.Date = as.Date(df$Document.Date, format = '%d/%m/%Y')

    library(tidyverse)
    library(plyr)

    grouping_for_AA =
        data.frame(
            df %>%
            group_by(Sales.Document,  Material) %>%
            dplyr::select(Sales.Document, Material, Document.Date)
        )


#Create transaction data building a list of material for each sales doc
#separated by a ,
transactionData = ddply(grouping_for_AA, c('Sales.Document'),
                        function(df) paste(df$Material,
                        collapse = ',')
                        )

but when I use the as(data, 'transactions') function R say me to discretize input, so I use as.factor for the Product list column, but doing this each transaction becomes a factor level and no rules can be mined (clearly).

#set column InvoiceNo of dataframe transactionData  
transactionData$Sales.Document <- NULL
#Change name of lists of Materials
colnames(transactionData) = 'Material'

#transform to factor
transactionData = data.frame(lapply(transactionData, factor))


#Create a transaction object: errors can be due to the package containing 'as'
trObj <- as(transactionData, "transactions")

I already tried dataframes in single and basket format, but I could not solve it.

Any Idea on how to transform a dataframe into transaction format without exporting and reloading data?


Solution

  • You can try this, to convert your data.frame in a transaction dataset. I've added a fake date, but I think it's useless, due you are not using it in your elaboration:

    data$Document.Date <- Sys.Date()
    data
      Sales.Document Material Document.Date
    1              1        A    2018-11-21
    2              1        B    2018-11-21
    3              1        C    2018-11-21
    4              2        A    2018-11-21
    5              2        C    2018-11-21
    6              3        A    2018-11-21
    

    Now exactly your dataset: you can add data.frame() in the dplyr chain:

    library(tidyverse)
    library(plyr)
    grouping_for_AA <- data %>%
                       group_by(Sales.Document,  Material) %>%
                       dplyr::select(Sales.Document, Material, Document.Date) %>%
                       data.frame()
    

    Now you can transform in a transactions data:

    library(arules)
    library(reshape2)
    trans <- as(split(grouping_for_AA[,"Material"], grouping_for_AA[,"Sales.Document"]), "transactions")
    
    inspect(trans)
        items   transactionID
    [1] {A,B,C} 1            
    [2] {A,C}   2            
    [3] {A}     3    
    

    Lastly, you can apply the apriori() function:

    rules <- apriori(trans, parameter = list(supp = 0.3, conf = 0.3, target="rules", minlen=2)) 
    inspect(rules)
        lhs      rhs support   confidence lift count
    [1] {B}   => {C} 0.3333333 1.0000000  1.5  1    
    [2] {C}   => {B} 0.3333333 0.5000000  1.5  1    
    [3] {B}   => {A} 0.3333333 1.0000000  1.0  1    
    [4] {A}   => {B} 0.3333333 0.3333333  1.0  1    
    [5] {C}   => {A} 0.6666667 1.0000000  1.0  2    
    [6] {A}   => {C} 0.6666667 0.6666667  1.0  2    
    [7] {B,C} => {A} 0.3333333 1.0000000  1.0  1    
    [8] {A,B} => {C} 0.3333333 1.0000000  1.5  1    
    [9] {A,C} => {B} 0.3333333 0.5000000  1.5  1