Search code examples
rdataframetransactionsarules

Preparing discretization data for arules


I have a data set which is applied to discretization proceeding, and I want to coerce the data set to transactions for using arules package.

CLUST_K <- structure(list(LONGITUDE = c(118.5, 118.5, 118.5, 118.5, 118.5, 
                    118.5), LATITUDE = c(-11.5, -11.5, -11.5, -11.5, -11.5, -11.5
                    ), DATE_START = structure(c(1419897600, 1419984000, 1420070400, 
                    1420156800, 1420243200, 1420329600), class = c("POSIXct", "POSIXt"
                    )), DATE_END = structure(c(1420502400, 1420588800, 1420675200, 
                    1420761600, 1420848000, 1420934400), class = c("POSIXct", "POSIXt"
                    )), FLAG = c(2, 1, 2, 2, 2, 2), SURFSKINTEMP = c(13L, 1L, 16L, 
                    16L, 7L, 13L), SURFAIRTEMP = c(6L, 6L, 6L, 6L, 6L, 6L), TOTH2OVAP = c(5L, 
                    17L, 17L, 17L, 17L, 17L), TOTO3 = c(16L, 16L, 16L, 10L, 7L, 7L
                    ), TOTCO = c(12L, 12L, 8L, 4L, 12L, 12L), TOTCH4 = c(13L, 14L, 
                    6L, 6L, 11L, 7L), OLR_ARIS = c(10L, 4L, 4L, 7L, 5L, 10L), CLROLR_ARIS = c(10L, 
                    4L, 4L, 7L, 5L, 10L), OLR_NOAA = c(10L, 10L, 10L, 10L, 7L, 9L
                    ), MODIS_LST = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("LONGITUDE", 
                    "LATITUDE", "DATE_START", "DATE_END", "FLAG", "SURFSKINTEMP", 
                    "SURFAIRTEMP", "TOTH2OVAP", "TOTO3", "TOTCO", "TOTCH4", "OLR_ARIS", 
                    "CLROLR_ARIS", "OLR_NOAA", "MODIS_LST"), row.names = c(NA, 6L
                    ), class = "data.frame")    

from the data set CLUST_K, you can see that

    LONGITUDE LATITUDE DATE_START   DATE_END FLAG SURFSKINTEMP SURFAIRTEMP TOTH2OVAP TOTO3 TOTCO TOTCH4 OLR_ARIS    CLROLR_ARIS OLR_NOAA MODIS_LST
1     118.5    -11.5 2014-12-30 2015-01-06    2           13           6         5    16    12     13       10           10       10         1
2     118.5    -11.5 2014-12-31 2015-01-07    1            1           6        17    16    12     14        4            4       10         1
3     118.5    -11.5 2015-01-01 2015-01-08    2           16           6        17    16     8      6        4            4       10         1
4     118.5    -11.5 2015-01-02 2015-01-09    2           16           6        17    10     4      6        7            7       10         1
5     118.5    -11.5 2015-01-03 2015-01-10    2            7           6        17     7    12     11        5            5        7         1
6     118.5    -11.5 2015-01-04 2015-01-11    2           13           6        17     7    12      7       10           10        9         1

first column to fifth column of the data set is the transaction information, and column 6 to column 15 are the transactions, and which are applied to discretization proceeding.

when I try to coerce the data set to transactions

CLUST_K_R <- CLUST_K[,6:15]
CLUST_K_R_T <- as(CLUST_K_R,"transactions")
Error in asMethod(object) : 
  column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 not logical or a factor. Discretize the columns first.

but I the data set has already applied to discretization proceeding

When I use split, it also seems not right

> s1 <- split(CLUST_K$SURFSKINTEMP, CLUST_K$SURFAIRTEMP,CLUST_K$TOTH2OVAP, CLUST_K$TOTO3)
> Tr <- as(s1,"transactions")
Warning message:
In asMethod(object) : removing duplicated items in transactions
> Tr
transactions in sparse format with
 1 transactions (rows) and
 4 items (columns)

only 1 transactions left, but it should be 6 transactions in my case.


Solution

  • Since you already discretized the data (via clustering), you only need to make sure that the data is encoded as nominal values (factor) not numbers (integer).

    for(i in 1:ncol(CLUST_K_R)) CLUST_K_R[[i]] <- as.factor(CLUST_K_R[[i]])
    CLUST_K_R_T <- as(CLUST_K_R,"transactions")
    
    summary(CLUST_K_R_T)
    
    transactions as itemMatrix in sparse format with
     6 rows (elements/itemsets/transactions) and
     30 columns (items) and a density of 0.3333333 
    
    most frequent items:
    SURFAIRTEMP=6   MODIS_LST=1  TOTH2OVAP=17      TOTCO=12   OLR_NOAA=10       (Other) 
                6             6             5             4             4            35 
    
    element (itemset/transaction) length distribution:
    sizes
    10 
     6 
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
         10      10      10      10      10      10 
    
    includes extended item information - examples:
               labels    variables levels
    1  SURFSKINTEMP=1 SURFSKINTEMP      1
    2  SURFSKINTEMP=7 SURFSKINTEMP      7
    3 SURFSKINTEMP=13 SURFSKINTEMP     13
    
    includes extended transaction information - examples:
      transactionID
    1             1
    2             2
    3             3