I have a set of data by store by week with 2 features - sales and pp. I want to cluster them based on these features ideally to group the ones with the most similar trading patterns throughout the year.
Is it possible to do this with the data I have. My understanding is the features are the columns and the rows are what will be assigned labels based on the clusters, but I have weeks to take into account so I don't know if these should be in the rows or the columns?
library(data.table)
dt <- data.table(weeks = rep(seq(1:5),5),
store = c(rep("a", 5), rep("b", 5), rep("c", 5),
rep("e", 5), rep("d", 5)),
sales = rep(rnorm(5), 5),
pp = rep(rnorm(5), 5))
dt <- dcast.data.table(dt, store ~ weeks, value.var = c("sales", "pp"))
thanks
Since you have more than 1000 stores, the below demo might not be applicable directly but hopefully point you in the right direction.
You can analyze clustering of stores for a weekly basis or any another variant of weeks(N=4,8,...)
Here we look at store clusters at weekly frequency:
Test Data
library(dplyr) #data manipulation
library(ggdendro) #extracting clusters
library(ggplot2) #plotting
library(gridExtra) #for arranging ggplot graphs in grids
set.seed(42)
DF <- data.frame(weeks = rep(seq(1:5),5),
store = c(rep("A", 5), rep("B", 5), rep("C", 5),
rep("E", 5), rep("D", 5)),
sales = rnorm(25),
pp = rnorm(25))
weekInput = unique(DF$weeks)
Choosing frequency of weeks:
In the simple case, I have included discrete weeks i.e. 1,2,3,4 and 5.
Suppose you wish to choose N = 4 as frequency of weeks i.e. weeks 1-to-4,5-to-8 etc, you can use following to create partitions of weekInput over the year and modify as per your requirements.
#weekFreq = 4
#StartWeek = 1
#EndWeek = 52
#startPoints=seq(StartWeek,EndWeek,weekFreq)
#endPoints= c(tail(startPoints,-1)-1,EndWeek)
#
#freqDF = data.frame(cbind(startPoints,endPoints))
#weekInput = lapply(1:nrow(freqDF),function(x) { z= freqDF[x,]; z=as.vector(as.matrix(z)) } )
#head(weekInput)
#[[1]]
#[1] 1 4
#
#[[2]]
#[1] 5 8
#
#[[3]]
#[1] 9 12
#
#[[4]]
#[1] 13 16
#
#[[5]]
#[1] 17 20
#
#[[6]]
#[1] 21 24
Plotting of clusters
Good resources for plotting various dendrograms are here and here
For each week, we calculate distance matrix over numeric data and create hierarchical clusters and use ggdendro package for plotting and output a list of plot objects
plotList = lapply(weekInput,function(x) {
subsetWeek=DF %>%
group_by(weeks) %>%
filter(weeks==x) %>% #you could change this to `weeks %in% c(x[1],x[2])`
as.data.frame() %>% # x[1] and x[2] are start and end points of weekInput
select(-weeks) %>%
as.data.frame()
#For numeric features of data, compute the distance matrix and form hierarchical cluster
numericDF= subsetWeek[,sapply(subsetWeek,is.numeric)]
clustDF = hclust(dist(numericDF))
#You can choose to limit the clusters to N = n, as per your discretion
#clustDF = cutree(clustDF, 4)
clustDF$labels = subsetWeek$store
#Use functions from ggdendro package for extracting clusters for ease in plotting
clustDendro = as.dendrogram(clustDF)
dendroData = dendro_data(clustDendro,type="rectangle")
Labels = label(dendroData)
Labels$group <- c(rep("Area1", 2), rep("Area2", 2), rep("Area3", 1))
gPlot = ggplot(segment(dendroData)) +
geom_segment(aes(x=x,y=y,xend=xend,yend=yend)) +
geom_label(data=Labels,aes(label=label,x=x,y=0,label.size=5,colour=Labels$group,fontface="bold")) +
ggtitle(paste0("Store Clusters for Week:",x)) +
labs(color="Area Names\n")
gPlot = gPlot + theme(legend.title = element_text(face = "bold"))
return(gPlot)
})
Arranging Plots
The list of plot objects from above can be arranged as per requirement. Useful link on more details are here
grid::grid.newpage()
grid::grid.draw(do.call(rbind,lapply(plotList,function(x) ggplotGrob(x))))
All Weeks
Due to single column arrangement, readability is slightly affected