I've been trying to use cSPADE on a dataset I have with ~7million records in my transactions file (7 million unique sequenceID x eventID pairs). The support results I get when I try to run cSPADE on this dataset seem completely wrong. However, when I use ~86,000 records (the head of the previous file, more or less), the results look right. I've noticed that till this point the verbose log prints out that only 1 partition is used, while when I try ~850,000 records, 3 partitions are used.
Verbose output when using 100,000 records (with reasonable looking results):
> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE))
parameter specification:
support : 0.1
maxsize : 10
maxlen : 1
algorithmic control:
bfstype : FALSE
verbose : TRUE
summary : FALSE
tidLists : FALSE
preprocessing ... 1 partition(s), 1.98 MB [0.7s]
mining transactions ... 0 MB [0.21s]
reading sequences ... [0.03s]
total elapsed time: 0.94s
> summary(s1)
set of 14 sequences with
most frequent items:
A B C D E (Other)
2 2 1 1 1 8
.
.
.
summary of quality measures:
support
Min. :0.1306
1st Qu.:0.3701
Median :0.7021
Mean :0.5773
3rd Qu.:0.7184
Max. :0.9903
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
trans 83686 10059 0.1
Verbose output when using 1000,000 records (with wrong looking results):
> s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control =
list(verbose = TRUE))
parameter specification:
support : 0.1
maxsize : 10
maxlen : 1
algorithmic control:
bfstype : FALSE
verbose : TRUE
summary : FALSE
tidLists : FALSE
preprocessing ... 3 partition(s), 19.55 MB [4.6s]
mining transactions ... 0 MB [0.6s]
reading sequences ... [0.01s]
total elapsed time: 5.19s
> summary(s1)
set of 0 sequences with
most frequent items:
integer(0)
most frequent elements:
integer(0)
element (sequence) size distribution:
< table of extent 0 >
sequence length distribution:
< table of extent 0 >
summary of quality measures:
< table of extent 0 >
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
trans 826830 96238 0.1
I found that I can set the number of partitions to 1 when calling cSPADE and that fixed the problem. However cSPADE does output a warning saying:
s1 <- cspade(trans, parameter = list(support = 0.1,maxlen=1), control = list(verbose = TRUE,numpart=1))
Warning message: In cspade(trans, parameter = list(support = 0.1, maxlen = 1), control = list(verbose = TRUE, : 'numpart' less than recommended
Do I need to heed this warning? What are the downsides of setting numpart=1 (forcing #partitions to be 1)? If there is, is there any way for me to get right answers without controlling this parameter?
For the benefit of others who may run into the same problem. I ended up emailing the author the package. He said this was not a known issue and suggested that i stick to numpart=1.