Earlier question
In this post I asked how to extract the so called tidList that gives information about whether the frequent sequences found are present in each of the transactions used to mine these frequent sequences. More specific, how can one extract the boolean matrix (that represents the presence or absence of a sequence)in such a way that the row order is the same as in the original transactions dataset?
Eventually that turned out to be quite easy to do by using the tidList's transactionInfo attribute that is stored in the object of class sequences.
New question
This question is a little different from the earlier question: how can I 'score' new transactions given a set of frequent sequences. I.e. how can I obtain a tidList kind of object from a new object of type transactions, given an object of type sequences?
To illustrate this, I designed an example using some toy data sets:
library(arules)
library(arulesSequences)
library(stringr)
#Function used to convert character string into an object of type transactions.
#Source: https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
as.transactions <- function(clickstreamList) {
transactionID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(names(clickstreamList)[x], length(clickstreamList[[x]]))), use.names = F)
sequenceID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(x, length(clickstreamList[[x]]))))
eventID <- unlist(lapply(clickstreamList, FUN = function(x)
1:length(x)), use.names = F)
transactionInfo <- data.frame(transactionID, sequenceID, eventID)
tr <- as(as.data.frame(unlist(clickstreamList, use.names = F)), "transactions")
transactionInfo(tr) <- transactionInfo
itemInfo(tr)$labels <- itemInfo(tr)$levels
return(tr)
}
#Dataset to mine frequent sequences from
data_mine_freq_seq <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
#Convert data to list containing character vectors
data_for_fseq_mining <- str_split(string = data_mine_freq_seq$transaction, pattern = " ")
#Include identifiers as names
names(data_for_fseq_mining) <- data_mine_freq_seq$id
#Convert to object of type transactions
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
#Mine frequent sequences with cspade, given some parameters.
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.10, maxlen = 4, maxgap = 2),
control = list(tidList = TRUE, verbose = TRUE))
#Create a data frame that contains all sequences and their support (167 sequences in total).
sequences_df <- cbind(sequence = labels(sequences),
support = sequences@quality)
Now I create a new dataset that contains just one transaction:
data_score <- data.frame(id = 11, transaction = "A B B C D A")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
How can I find out which frequent sequences that are contained in object sequences are present in 'data_score_trans'?
EDIT
I tried the following line of code:
supportingTransactions(x = sequences, transactions = data_score_trans)
Which yields the expected and desired result:
tidLists in sparse format with
167 items/itemsets (rows) and
1 transactions (columns)
But when the new transaction contains an element that is not in the original dataset, an error occurs:
#Added a 'G' at the end of the transaction. Element 'G' is not an element in
#'data_mine_freq_seq'.
data_score <- data.frame(id = 11, transaction = "A B B C D A G")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
#Score 'data_score_trans' using 'sequences' again:
supportingTransactions(x = sequences, transactions = data_score_trans)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
How to solve this?
I came up with a workaround that makes use of the power of regular expressions. I defined the following function:
score_pattern <- function(pattern, events){
regex_elements <- str_extract_all(string = pattern, pattern = "\\{.*?\\}")
regex_elements <- str_replace_all(string = unlist(regex_elements),
pattern = "\\{|\\}", replacement = "")
expr <- ""
for(i in 1:length(regex_elements)){
if(i == 1){
expr <- paste0(expr, "(^| )", regex_elements[i], collapse = "")
} else {
expr <- paste0(expr, "( | .*? )", regex_elements[i], collapse = "")
}
}
expr <- paste0(expr, "( |$)", collapse = "")
print(expr)
score_pattern <- ifelse(test = grepl(pattern = expr, x = events) == TRUE,
yes = 1, no = 0)
return(score_pattern)
}
To illustrate it's use. Here's an example in which I make use of objects 'sequences_df' (pick a sequence from column 'sequence') and transaction data in 'data_score', column 'transaction':
score_pattern(pattern = "<{B},{A}>", events = data_score$transaction)
[1] "(^| )B( | .*? )A( |$)"
[1] 1
The function returns a numeric vector that contains zeros and ones, indicating whether the sequence is present in the transactions provided (1 = yes, 0 = no).
Though this is a solution, it is a solution only to cases in which no restriction has been applied to the maximum gap between successive elements in a sequence. E.g. the regular expression created has no 'maxgap'-parameter. Conclusion: this will only work when the parameter 'maxgap' in the cspade algorithm is not set.