How to convert dataframe into usable format for sequence mining in R?

I'd like to do sequence analysis in R, and I'm trying to convert my data into a usable form for the arulesSequences package.

library(tidyverse)
library(arules)
library(arulesSequences)

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

If leave my columns as their original class as above, I get an error:

Error in asMethod(object) : 
  column(s) 1, 2, 3, 4 not logical or a factor. Discretize the columns first.

However, if I convert the columns to factors, I get another error:

df <- data_frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

df <- as.data.frame(lapply(df, as.factor))
df.trans <- as(df, "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
seq <- cspade(df.trans, parameter = list(support = 0.4), control = list(verbose = TRUE))

Error in asMethod(object) :
In makebin(data, file) : 'eventID' is a factor

Any advice on getting around this or advice on sequence mining in R in general is much appreciated. Thanks!

Solution

Only the actual items (in your case "site") go into the transactions. Always inspect your intermediate results to make sure it looks right. The type of transactions needed for sequence mining is described in ? cspade.

library("arulesSequences")
df <- data.frame(personID = c(1, 1, 2, 2, 2),
             eventID = c(100, 101, 102, 103, 104),
             site = c("google", "facebook", "facebook", "askjeeves", "stackoverflow"),
             sequence = c(1, 2, 1, 2, 3))

# convert site into itemsets and add sequence and event ids
df.trans <- as(df[,"site", drop = FALSE], "transactions")
transactionInfo(df.trans)$sequenceID <- df$sequence
transactionInfo(df.trans)$eventID <- df$eventID
inspect(df.trans)

# sort by sequenceID
df.trans <- df.trans[order(transactionInfo(df.trans)$sequenceID),]
inspect(df.trans)

# mine sequences
seq <- cspade(df.trans, parameter = list(support = 0.2), 
              control = list(verbose = TRUE))
inspect(seq)

Hope this helps!