r cluster-analysis k-means hierarchical-clustering

Clustering with multiple features by week

I have a set of data by store by week with 2 features - sales and pp. I want to cluster them based on these features ideally to group the ones with the most similar trading patterns throughout the year.

Is it possible to do this with the data I have. My understanding is the features are the columns and the rows are what will be assigned labels based on the clusters, but I have weeks to take into account so I don't know if these should be in the rows or the columns?

library(data.table)
dt <- data.table(weeks = rep(seq(1:5),5),
             store = c(rep("a", 5), rep("b", 5), rep("c", 5), 
                       rep("e", 5), rep("d", 5)),
             sales = rep(rnorm(5), 5),
             pp = rep(rnorm(5), 5))
dt <- dcast.data.table(dt, store ~ weeks, value.var = c("sales", "pp"))

thanks

Solution

Since you have more than 1000 stores, the below demo might not be applicable directly but hopefully point you in the right direction.

You can analyze clustering of stores for a weekly basis or any another variant of weeks(N=4,8,...)

Here we look at store clusters at weekly frequency:

Test Data

library(dplyr)     #data manipulation
library(ggdendro)  #extracting clusters
library(ggplot2)   #plotting
library(gridExtra) #for arranging ggplot graphs in grids 

set.seed(42)

DF <- data.frame(weeks = rep(seq(1:5),5),
    store = c(rep("A", 5), rep("B", 5), rep("C", 5), 
            rep("E", 5), rep("D", 5)),
    sales = rnorm(25),
    pp = rnorm(25))


weekInput = unique(DF$weeks)

Choosing frequency of weeks:

In the simple case, I have included discrete weeks i.e. 1,2,3,4 and 5.

Suppose you wish to choose N = 4 as frequency of weeks i.e. weeks 1-to-4,5-to-8 etc, you can use following to create partitions of weekInput over the year and modify as per your requirements.

#weekFreq = 4
#StartWeek = 1
#EndWeek = 52
#startPoints=seq(StartWeek,EndWeek,weekFreq)    
#endPoints= c(tail(startPoints,-1)-1,EndWeek)
#
#freqDF = data.frame(cbind(startPoints,endPoints))
#weekInput = lapply(1:nrow(freqDF),function(x) { z= freqDF[x,]; z=as.vector(as.matrix(z)) } )
#head(weekInput)
#[[1]]
#[1] 1 4
#
#[[2]]
#[1] 5 8
#
#[[3]]
#[1]  9 12
#
#[[4]]
#[1] 13 16
#
#[[5]]
#[1] 17 20
#
#[[6]]
#[1] 21 24

Plotting of clusters

Good resources for plotting various dendrograms are here and here

For each week, we calculate distance matrix over numeric data and create hierarchical clusters and use ggdendro package for plotting and output a list of plot objects

plotList = lapply(weekInput,function(x) {

subsetWeek=DF %>% 
  group_by(weeks) %>% 
  filter(weeks==x) %>%  #you could change this to `weeks %in% c(x[1],x[2])`
  as.data.frame()  %>%  # x[1] and x[2] are start and end points of weekInput
  select(-weeks) %>% 
  as.data.frame()

#For numeric features of data, compute the distance matrix and form hierarchical cluster

numericDF= subsetWeek[,sapply(subsetWeek,is.numeric)]

clustDF = hclust(dist(numericDF))

#You can choose to limit the clusters to N = n, as per your discretion
#clustDF =  cutree(clustDF, 4)


clustDF$labels = subsetWeek$store

#Use functions from ggdendro package for extracting clusters for ease in plotting

clustDendro = as.dendrogram(clustDF)

dendroData = dendro_data(clustDendro,type="rectangle")

Labels = label(dendroData)
Labels$group <- c(rep("Area1", 2), rep("Area2", 2), rep("Area3", 1))


gPlot = ggplot(segment(dendroData)) +
    geom_segment(aes(x=x,y=y,xend=xend,yend=yend)) + 
    geom_label(data=Labels,aes(label=label,x=x,y=0,label.size=5,colour=Labels$group,fontface="bold")) +
    ggtitle(paste0("Store Clusters for Week:",x)) +
    labs(color="Area Names\n")


gPlot = gPlot + theme(legend.title = element_text(face = "bold"))

return(gPlot)


})

Arranging Plots

The list of plot objects from above can be arranged as per requirement. Useful link on more details are here

grid::grid.newpage()
grid::grid.draw(do.call(rbind,lapply(plotList,function(x) ggplotGrob(x))))

Single Week:

All Weeks

Due to single column arrangement, readability is slightly affected