Search code examples
rarules

Why does as(..., "transactions") for arules in R seem to lose transactions?


I have a large dataset in CSV:

see attached image

  • There are 50,000 rows, each row is one transaction.
  • There are a maximum of 5 items and a minimum of 1 item in each transaction.
  • There are 5000 different possible item values.
  • There are no duplicate items in a transaction.

After loading the CSV into RStudio and applying unclass(), I apply as(...,"transactions").

The result is something like this:

# transactions in sparse format with
#  5 transactions (rows) and
#  1455 items (columns)

Instead of 50,000 transactions, there are only 5 now.

Where have all the transactions gone? Was the matrix somehow transposed (as the row count in the result equals the column count of my CSV)?

This may be a data pre-processing problem, but according to this post my input data should have the right format.

[I'm posting for the first time here and am fairly new to R/RStudio.]


Solution

  • Have a look at the coercion methods in the man page ? transactions. You will see that you either need a binary incidence matrix, a list of transactions, or a data.frame containing only categorical variables. Your data is not one of these to as(..., "transactions") will fail.

    I think read.transactions can read you data.

    library(arules)
    
    # create and write some data
    data <- paste(
       "item1,item2,,,", 
       "item1,,,,", 
       "item2,item3,,,", 
       sep="\n")
    write(data, file = "demo_basket")
    
    # read the data
    tr <- read.transactions("demo_basket", format = "basket", sep=",")
    inspect(tr)
    
        items        
    [1] {item1,item2}
    [2] {item1}      
    [3] {item2,item3}