Search code examples
rsequencearules

How to convert dataframe into usable format for sequence mining in R?


I'd like to do sequence analysis in R, and I'm trying to convert my data into a usable form for the arulesSequences package.

library(tidyverse)
library(arules)
library(arulesSequences)

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

If leave my columns as their original class as above, I get an error:

Error in asMethod(object) : 
  column(s) 1, 2, 3, 4 not logical or a factor. Discretize the columns first.

However, if I convert the columns to factors, I get another error:

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

df <- as.data.frame(lapply(df, as.factor))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

Error in asMethod(object) :
In makebin(data, file) : 'eventID' is a factor

Any advice on getting around this or advice on sequence mining in R in general is much appreciated. Thanks!


Solution

  • Only the actual items (in your case "site") go into the transactions. Always inspect your intermediate results to make sure it looks right. The type of transactions needed for sequence mining is described in ? cspade.

    library("arulesSequences")
    df <- data.frame(personID = c(1, 1, 2, 2, 2),
                 eventID = c(100, 101, 102, 103, 104),
                 site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
                 sequence = c(1, 2, 1, 2, 3))
    
    # convert site into itemsets and add sequence and event ids
    df.trans <- as(df[,"site", drop = FALSE], "transactions")
    transactionInfo(df.trans)$sequenceID <- df$sequence
    transactionInfo(df.trans)$eventID <- df$eventID
    inspect(df.trans)
    
    # sort by sequenceID
    df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
    inspect(df.trans)
    
    # mine sequences
    seq <- cspade(df.trans, parameter = list(support = 0.2), 
                  control = list(verbose = TRUE))
    inspect(seq)
    

    Hope this helps!