Search code examples
rcluster-analysisk-meanshierarchical-clustering

Clustering with multiple features by week


I have a set of data by store by week with 2 features - sales and pp. I want to cluster them based on these features ideally to group the ones with the most similar trading patterns throughout the year.

Is it possible to do this with the data I have. My understanding is the features are the columns and the rows are what will be assigned labels based on the clusters, but I have weeks to take into account so I don't know if these should be in the rows or the columns?

library(data.table)
dt <- data.table(weeks = rep(seq(1:5),5),
             store = c(rep("a", 5), rep("b", 5), rep("c", 5), 
                       rep("e", 5), rep("d", 5)),
             sales = rep(rnorm(5), 5),
             pp = rep(rnorm(5), 5))
dt <- dcast.data.table(dt, store ~ weeks, value.var = c("sales", "pp"))

thanks


Solution

  • Since you have more than 1000 stores, the below demo might not be applicable directly but hopefully point you in the right direction.

    You can analyze clustering of stores for a weekly basis or any another variant of weeks(N=4,8,...)

    Here we look at store clusters at weekly frequency:

    Test Data

    library(dplyr)     #data manipulation
    library(ggdendro)  #extracting clusters
    library(ggplot2)   #plotting
    library(gridExtra) #for arranging ggplot graphs in grids 
    
    set.seed(42)
    
    DF <- data.frame(weeks = rep(seq(1:5),5),
        store = c(rep("A", 5), rep("B", 5), rep("C", 5), 
                rep("E", 5), rep("D", 5)),
        sales = rnorm(25),
        pp = rnorm(25))
    
    
    weekInput = unique(DF$weeks)
    

    Choosing frequency of weeks:

    In the simple case, I have included discrete weeks i.e. 1,2,3,4 and 5.

    Suppose you wish to choose N = 4 as frequency of weeks i.e. weeks 1-to-4,5-to-8 etc, you can use following to create partitions of weekInput over the year and modify as per your requirements.

    #weekFreq = 4
    #StartWeek = 1
    #EndWeek = 52
    #startPoints=seq(StartWeek,EndWeek,weekFreq)    
    #endPoints= c(tail(startPoints,-1)-1,EndWeek)
    #
    #freqDF = data.frame(cbind(startPoints,endPoints))
    #weekInput = lapply(1:nrow(freqDF),function(x) { z= freqDF[x,]; z=as.vector(as.matrix(z)) } )
    #head(weekInput)
    #[[1]]
    #[1] 1 4
    #
    #[[2]]
    #[1] 5 8
    #
    #[[3]]
    #[1]  9 12
    #
    #[[4]]
    #[1] 13 16
    #
    #[[5]]
    #[1] 17 20
    #
    #[[6]]
    #[1] 21 24
    

    Plotting of clusters

    Good resources for plotting various dendrograms are here and here

    For each week, we calculate distance matrix over numeric data and create hierarchical clusters and use ggdendro package for plotting and output a list of plot objects

    plotList = lapply(weekInput,function(x) {
    
    subsetWeek=DF %>% 
      group_by(weeks) %>% 
      filter(weeks==x) %>%  #you could change this to `weeks %in% c(x[1],x[2])`
      as.data.frame()  %>%  # x[1] and x[2] are start and end points of weekInput
      select(-weeks) %>% 
      as.data.frame()
    
    #For numeric features of data, compute the distance matrix and form hierarchical cluster
    
    numericDF= subsetWeek[,sapply(subsetWeek,is.numeric)]
    
    clustDF = hclust(dist(numericDF))
    
    #You can choose to limit the clusters to N = n, as per your discretion
    #clustDF =  cutree(clustDF, 4)
    
    
    clustDF$labels = subsetWeek$store
    
    #Use functions from ggdendro package for extracting clusters for ease in plotting
    
    clustDendro = as.dendrogram(clustDF)
    
    dendroData = dendro_data(clustDendro,type="rectangle")
    
    Labels = label(dendroData)
    Labels$group <- c(rep("Area1", 2), rep("Area2", 2), rep("Area3", 1))
    
    
    gPlot = ggplot(segment(dendroData)) +
        geom_segment(aes(x=x,y=y,xend=xend,yend=yend)) + 
        geom_label(data=Labels,aes(label=label,x=x,y=0,label.size=5,colour=Labels$group,fontface="bold")) +
        ggtitle(paste0("Store Clusters for Week:",x)) +
        labs(color="Area Names\n")
    
    
    gPlot = gPlot + theme(legend.title = element_text(face = "bold"))
    
    return(gPlot)
    
    
    })
    

    Arranging Plots

    The list of plot objects from above can be arranged as per requirement. Useful link on more details are here

    grid::grid.newpage()
    grid::grid.draw(do.call(rbind,lapply(plotList,function(x) ggplotGrob(x))))
    

    Single Week: enter image description here

    All Weeks

    Due to single column arrangement, readability is slightly affected enter image description here