Search code examples
rarules

R arulesSequences Find which patterns are supported by a sequence


I'm having troubles with the arulesSequences library in R

I have a transactional dataset with temporal information (here, let's use the default zaki dataset). I use SPADE (cspade function) to find the frequent subsequences in the dataset.

library(arulesSequences)
data(zaki)
frequent_sequences <- cspade(zaki, parameter=list(support=0.5))

Now, what I want is to find, for each sequence (i.e. for each custumer) which are the frequent subsequences that it supports. I tried various combinations of %in% and subset without much success.

For example for the second custumer, the initial transactions inspect(zaki[zaki@itemsetInfo$sequenceID==2]) are:

items     sequenceID eventID SIZE
5 {A,B,F} 2          15      3   
6 {E}     2          20      1 

The frequent sequences in the whole dataset inspect(frequent_sequences) are:

items support 
1 <{A}>    1.00 
2 <{B}>    1.00 
3 <{D}>    0.50 
4 <{F}>    1.00 
5 <{A, F}>    0.75 
6 <{B, F}>    1.00 
7 <{D}, {F}>    0.50 
8 <{D}, {B, F}>    0.50 
9 <{A, B, F}>    0.75 
10 <{A, B}>    0.75 
11 <{D}, {B}>    0.50 
12 <{B}, {A}>    0.50 
13 <{D}, {A}>    0.50 
14 <{F}, {A}>    0.50 
15 <{D}, {F}, {A}>    0.50 
16 <{B, F}, {A}>    0.50 
17 <{D}, {B, F}, {A}>    0.50 
18 <{D}, {B}, {A}>    0.50 

What I'd like to see is that customer 2 supports the frequent sequences 1, 2, 4, 5, 6, 9 and 10, but does not support the others.

I could also settle for the reverse information: which are the base sequences that support a given frequent subsequence? R somehow knows this information, since it uses it to compute the support of the frequent sequences.

It seems to me that this should be easy (and it probably is!) but I can't seem to figure it out...

Any idea ?


Solution

  • After some cool-headed digging, I found a way to do it, and indeed, it was easy... since the support function does the job!

    ids <- unique(zaki@itemsetInfo$sequenceID)
    encoding <- data.frame()
    
    # Prepare the data.frame: as many columns as there are frequent sequences
    for (seq_id in 1:length(frequent_sequences)){
        encoding[,labels(frequent_sequences[seq_id])] <- logical(0)
    }
    
    # Fill the rows
    for (id in ids){
        transaction_subset <- zaki[zaki@itemsetInfo$sequenceID==id]
        encoding[id, ] <- as.logical(
            support(frequent_sequences, transaction_subset, type="absolute")
            )
    }
    

    There might be more aesthetic ways to reach the result, but this yields the expected result:

    > encoding
      <{A}> <{B}> <{D}> <{F}> <{A,F}> <{B,F}> <{D},{F}> <{D},{B,F}> <{A,B,F}>
    1  TRUE  TRUE  TRUE  TRUE    TRUE    TRUE      TRUE        TRUE      TRUE
    2  TRUE  TRUE FALSE  TRUE    TRUE    TRUE     FALSE       FALSE      TRUE
    3  TRUE  TRUE FALSE  TRUE    TRUE    TRUE     FALSE       FALSE      TRUE
    4  TRUE  TRUE  TRUE  TRUE   FALSE    TRUE      TRUE        TRUE     FALSE
      <{A,B}> <{D},{B}> <{B},{A}> <{D},{A}> <{F},{A}> <{D},{F},{A}> <{B,F},{A}>
    1    TRUE      TRUE      TRUE      TRUE      TRUE          TRUE        TRUE
    2    TRUE     FALSE     FALSE     FALSE     FALSE         FALSE       FALSE
    3    TRUE     FALSE     FALSE     FALSE     FALSE         FALSE       FALSE
    4   FALSE      TRUE      TRUE      TRUE      TRUE          TRUE        TRUE
      <{D},{B,F},{A}> <{D},{B},{A}>
    1            TRUE          TRUE
    2           FALSE         FALSE
    3           FALSE         FALSE
    4            TRUE          TRUE