Search code examples
rsequenceaprioriarules

Odd results from cSPADE in R (arulesSequences) w/ large data. Can I force numpart to 1? Are there risks?


I've been trying to use cSPADE on a dataset I have with ~7million records in my transactions file (7 million unique sequenceID x eventID pairs). The support results I get when I try to run cSPADE on this dataset seem completely wrong. However, when I use ~86,000 records (the head of the previous file, more or less), the results look right. I've noticed that till this point the verbose log prints out that only 1 partition is used, while when I try ~850,000 records, 3 partitions are used.

Verbose output when using 100,000 records (with reasonable looking results):

> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE))

parameter specification:
support : 0.1
maxsize :  10
maxlen  :   1

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 1 partition(s), 1.98 MB [0.7s]
mining transactions ... 0 MB [0.21s]
reading sequences ... [0.03s]

total elapsed time: 0.94s

> summary(s1)
set of 14 sequences with

most frequent items:
      A       B       C       D       E (Other) 
      2       2       1       1       1       8 

.
.
.
summary of quality measures:
    support      
 Min.   :0.1306  
 1st Qu.:0.3701  
 Median :0.7021  
 Mean   :0.5773  
 3rd Qu.:0.7184  
 Max.   :0.9903  

includes transaction ID lists: FALSE 

mining info:
  data ntransactions nsequences support
 trans         83686      10059     0.1

Verbose output when using 1000,000 records (with wrong looking results):

> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = 
list(verbose = TRUE))

parameter specification:
support : 0.1
maxsize :  10
maxlen  :   1

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 3 partition(s), 19.55 MB [4.6s]
mining transactions ... 0 MB [0.6s]
reading sequences ... [0.01s]

total elapsed time: 5.19s

> summary(s1)

set of 0 sequences with

most frequent items:
integer(0)

most frequent elements:
integer(0)

element (sequence) size distribution:
< table of extent 0 >

sequence length distribution:
< table of extent 0 >

summary of quality measures:
< table of extent 0 >

includes transaction ID lists: FALSE 

mining info:
  data ntransactions nsequences support
 trans        826830      96238     0.1

I found that I can set the number of partitions to 1 when calling cSPADE and that fixed the problem. However cSPADE does output a warning saying:

s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE,numpart=1))

Warning message: In cspade(trans, parameter = list(support = 0.1, maxlen = 1), control = list(verbose = TRUE,  :  'numpart' less than recommended

Do I need to heed this warning? What are the downsides of setting numpart=1 (forcing #partitions to be 1)? If there is, is there any way for me to get right answers without controlling this parameter?


Solution

  • For the benefit of others who may run into the same problem. I ended up emailing the author the package. He said this was not a known issue and suggested that i stick to numpart=1.