I have been trying to do sequential analysis of products purchased after a certain period of time, like what products combination are being purchased after 7 days by customers and what proportion of customers are purchasing such combination, i have tried arulesSequence package but my data is structured in a way which doesn't go with the package, i have userid, date of purchase, product id and product name in columns, i have searched a lot but haven't got any clear way to do.
Dayy UID leaf_category_name leaf_category_id
5/1/2018 47 Cubes 38860
5/1/2018 272 Pastas & Noodles 34616
5/1/2018 1827 Flavours & Spices 34619
5/1/2018 3505 Feature Phones 1506
this is the kind of data i have, UID stands for user id, leaf category is product purchased in simple terms. I have huge dataset with 2,049,278 rows.
codes i have tried-
library(Matrix)
library(arules)
library(arulesSequences)
library(arulesViz)
#splitting data into transactions
transactions <- as(split(data$leaf_category_id, data$UID), "transactions")
frequent_sequences <- cspade(transactions, parameter=list(support=0.5))
and
# Convert tabular data to sequences. Item is in
# column 1, sequence ID is column 2, and event ID is column 3.
seqs = make_sequences(data, item_col = 1, sid_col = 2, eid_col = 3)
# generate frequent sequential patterns with minimum
# support of 0.1 and maximum of 6 elements
fseq = spade(seqs, 0.1, 6)
I want to look at sequence of products being purchased after certain days. Can someone help me with this?
Thank You
The apriori path is quite nice, however, not having your data, we can use a famous dataset as example, like Groceries (in your case, you can subset your data after the data you want):
library(arules)
data(Groceries)
# here you can see the product with the biggest support
frequentproducts <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15))
inspect(frequentItems)
items support count
[1] {other vegetables,whole milk} 0.07483477 736
[2] {whole milk} 0.25551601 2513
[3] {other vegetables} 0.19349263 1903
[4] {rolls/buns} 0.18393493 1809
[5] {yogurt} 0.13950178 1372
[6] {soda} 0.17437722 1715
[7] {root vegetables} 0.10899847 1072
[8] {tropical fruit} 0.10493137 1032
[9] {bottled water} 0.11052364 1087
[10] {sausage} 0.09395018 924
[11] {shopping bags} 0.09852567 969
[12] {citrus fruit} 0.08276563 814
[13] {pastry} 0.08896797 875
[14] {pip fruit} 0.07564820 744
[15] {whipped/sour cream} 0.07168277 705
[16] {fruit/vegetable juice} 0.07229283 711
[17] {newspapers} 0.07981698 785
[18] {bottled beer} 0.08052872 792
[19] {canned beer} 0.07768175 764
If you prefere, you can plot it:
itemFrequencyPlot(Groceries, topN=5, type="absolute")
Then you can see the association rules:
association <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5))
inspect(head(association_conf))
lhs rhs support confidence lift count
[1] {rice,sugar} => {whole milk} 0.001220132 1 3.913649 12
[2] {canned fish,hygiene articles} => {whole milk} 0.001118454 1 3.913649 11
[3] {root vegetables,butter,rice} => {whole milk} 0.001016777 1 3.913649 10
[4] {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001728521 1 3.913649 17
[5] {butter,soft cheese,domestic eggs} => {whole milk} 0.001016777 1 3.913649 10
[6] {citrus fruit,root vegetables,soft cheese} => {other vegetables} 0.001016777 1 5.168156 10
You can see in the last column the count, how many times appears the each rules: this could be read as "how many rows", and, if each rows is a customer, the number of customers. However you have to think about what do you mean with how many customer, if you want for example this a,b,a,c >>> count = 4
or a,b,a,c >>> count 3
(pseudocode). In this case, you have to evaluate your data.
edit
you can lastly have a look at this, as you've stated, there is also the cspade algorithm that can help.