Search code examples

Analysis of products purchased after certain days

I have been trying to do sequential analysis of products purchased after a certain period of time, like what products combination are being purchased after 7 days by customers and what proportion of customers are purchasing such combination, i have tried arulesSequence package but my data is structured in a way which doesn't go with the package, i have userid, date of purchase, product id and product name in columns, i have searched a lot but haven't got any clear way to do.

Dayy        UID         leaf_category_name  leaf_category_id
5/1/2018    47      Cubes               38860
5/1/2018    272     Pastas & Noodles    34616
5/1/2018    1827    Flavours & Spices   34619
5/1/2018    3505    Feature Phones      1506

this is the kind of data i have, UID stands for user id, leaf category is product purchased in simple terms. I have huge dataset with 2,049,278 rows.

codes i have tried-



#splitting data into transactions
transactions <- as(split(data$leaf_category_id, data$UID), "transactions")

frequent_sequences <- cspade(transactions, parameter=list(support=0.5))


# Convert tabular data to sequences. Item is in
# column 1, sequence ID is column 2, and event ID is column 3.
seqs = make_sequences(data, item_col = 1, sid_col = 2, eid_col = 3)             

# generate frequent sequential patterns with minimum
# support of 0.1 and maximum of 6 elements
fseq = spade(seqs, 0.1, 6)

I want to look at sequence of products being purchased after certain days. Can someone help me with this?

Thank You


  • The apriori path is quite nice, however, not having your data, we can use a famous dataset as example, like Groceries (in your case, you can subset your data after the data you want):

    # here you can see the product with the biggest support
    frequentproducts <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) 
         items                         support    count
    [1]  {other vegetables,whole milk} 0.07483477  736 
    [2]  {whole milk}                  0.25551601 2513 
    [3]  {other vegetables}            0.19349263 1903 
    [4]  {rolls/buns}                  0.18393493 1809 
    [5]  {yogurt}                      0.13950178 1372 
    [6]  {soda}                        0.17437722 1715 
    [7]  {root vegetables}             0.10899847 1072 
    [8]  {tropical fruit}              0.10493137 1032 
    [9]  {bottled water}               0.11052364 1087 
    [10] {sausage}                     0.09395018  924 
    [11] {shopping bags}               0.09852567  969 
    [12] {citrus fruit}                0.08276563  814 
    [13] {pastry}                      0.08896797  875 
    [14] {pip fruit}                   0.07564820  744 
    [15] {whipped/sour cream}          0.07168277  705 
    [16] {fruit/vegetable juice}       0.07229283  711 
    [17] {newspapers}                  0.07981698  785 
    [18] {bottled beer}                0.08052872  792 
    [19] {canned beer}                 0.07768175  764 

    If you prefere, you can plot it:

    itemFrequencyPlot(Groceries, topN=5, type="absolute")

    Then you can see the association rules:

    association <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5)) 
      lhs                                           rhs                support     confidence lift     count
    [1] {rice,sugar}                               => {whole milk}       0.001220132 1          3.913649 12   
    [2] {canned fish,hygiene articles}             => {whole milk}       0.001118454 1          3.913649 11   
    [3] {root vegetables,butter,rice}              => {whole milk}       0.001016777 1          3.913649 10   
    [4] {root vegetables,whipped/sour cream,flour} => {whole milk}       0.001728521 1          3.913649 17   
    [5] {butter,soft cheese,domestic eggs}         => {whole milk}       0.001016777 1          3.913649 10   
    [6] {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1          5.168156 10   

    You can see in the last column the count, how many times appears the each rules: this could be read as "how many rows", and, if each rows is a customer, the number of customers. However you have to think about what do you mean with how many customer, if you want for example this a,b,a,c >>> count = 4 or a,b,a,c >>> count 3 (pseudocode). In this case, you have to evaluate your data.
    you can lastly have a look at this, as you've stated, there is also the cspade algorithm that can help.