Search code examples
rdata-miningarules

R-convert transaction format dataset to basket format for sequence mining


ORIGINAL TABLE

CELL NUMBER ----------ACTIVITY--------TIME<br/>
001................................call a................12.23<br/>
002................................call b................01.00<br/>
002................................call d................01.09<br/>
001................................call b................12.25<br/>
003................................call a................12.23<br/>
002................................call a................02.07<br/>
003................................call b................12.25<br/>

REQUIRED-

To mine the highest occurring sequence of ACTIVITY from a data-set of size 400,000

ABOVE EXAMPLE SHOULD SHOW

[call a-12.23,call b-12.25] frequency 2<br/>
[call b-01.00,call d-01.09,call a-02.07] frequency 1

I'm aware that this can be achieved using arulesSequences. What transformations on dataset do i need to carry out and how so as to use the arulesSequences package?

Current db format- transaction with 3 columns like sample above.


Solution

  • df<-read.table(header=T,sep="|",text="CELL NUMBER|ACTIVITY|TIME
    001|call a|12.23
    002|call b|01.00
    002|call d|01.09
    001|call b|12.25
    003|call a|12.23
    002|call a|02.07
    003|call b|12.25")
    
    
    require(plyr) # for count() function
    freqs<-count(df[,-1]) # [,-1] to exclude the CELL NUMBER column from the group
    freqs[order(-freqs$freq),]
      ACTIVITY  TIME freq
    2   call a 12.23    2
    4   call b 12.25    2
    1   call a  2.07    1
    3   call b  1.00    1
    5   call d  1.09    1
    

    EDIT - Updated like this:

    unique(ddply(freqs,.(-freq),summarise,calls=paste0("[",paste0(paste0(ACTIVITY,"-",TIME),collapse=","),"]","frequency",freq)))
    #  -freq                                        calls
    #1    -2        [call a-12.23,call b-12.25]frequency2
    #3    -1 [call a-2.07,call b-1,call d-1.09]frequency1