I am trying to cluster income trajectories using a large longitudinal dataset containing participants’ yearly reported incomes.
I have chosen to calculate distances between the trajectories using dynamic time warping and have successfully parallelised this process using the function proxy::dist
from the package dtwclust
(see Stage 2 below).
I have also been able to calculate PAM clusters for k=2:40 using a regular for loop with no parallelisation (see Stage 3 below). However, if possible, I would like to parallelise this stage of my analysis as well to save time.
Does anyone have any suggestions for how I can parallelise this clustering process?
P.S. I have tried using tsclust
from the package dtwclust
. This does successfully parallelise the clustering. However it also seems to crash my R session if I put in too many separate values for k. If anyone is aware of a clustering function in dtwclust
that will accept a pre-calculated distance matrix as input, that would be ideal. Though of course any other solutions are also very welcome!
EXAMPLE CODE
Stage 1: Import libraries and format data
# Import required libraries
library(tidyverse)
library(dtwclust)
library(parallel)
library(cluster)
# Set seed for reproducible results
set.seed(123)
# Generate different lengths of sample income trajectories
lengths = sample(7:10,500,replace = T) %>% as.list()
# Use rnorm to generate income trajectories of varying lengths, as defined above
inc_traj = list() %>% .[1:500] %>% map2(lengths, ~ rnorm(.y, 1588.647, 1484.186))
Stage 2: Calculate distance matrix (parallelised)
# Set up parallelisation
# Code taken from https://cran.r-project.org/web/packages/dtwclust/vignettes/parallelization-considerations.html
# create multi-process workers
workers <- makeCluster(detectCores())
# load dtwclust in each one, and make them use 1 thread per worker
invisible(clusterEvalQ(workers, {
library(dtwclust)
RcppParallel::setThreadOptions(1L)
}))
# register your workers, e.g. with doParallel
require(doParallel)
registerDoParallel(workers)
# Calculate distance matrix
distmat = proxy::dist(inc_traj, method = "dtw_basic")
Stage 3: Calculate PAM clusters (not parallelised)
# Create empty list to be populated with clusters
clusters = list()
# For loop which calculates partitions around medoids for k=2:40
for (i in 2:40) {
clusters[[i]] = distmat %>% pam(k=i, diss=T)
cat("\r",paste0(i," of 40 clusters calculated."))
}
OK, I’ve managed to successfully pass a precalculated distance matrix to the function tsclust
with help from @Alexis and code from the dtwclust
github page here. My solution is below for anyone else who’s interested.
Stage 4: Calculate PAM clusters (parallelised)
# Define vector number of clusters as integers
ks = c( 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,20L,
21L,22L,23L,24L,25L,26L,27L,28L,29L,30L,31L,32L,33L,34L,35L,36L,37L,38L,39L,40L)
# Create empty list to be populated with clusters
clusters = list()
# For loop which calculates partitions around medoids for k=2:40
for (i in 2:40) {
clusters[[i]] = tsclust(inc_traj, k = ks[i], distance = "dtw_basic", centroid = "pam",
control = partitional_control(distmat = distmat), seed = 3247, trace = TRUE)
cat("\r",paste0(i," of 40 clusters calculated."))
}
I also compared the processing time of both the tsclust
method outlined in Stage 4 above to the pam
method outlined in Stage 3 of the original question for k=2:4. I did this on my full dataset of 34,591 trajectories and found the tsclust
approach was substantially faster, so will be using that moving forward. I’ve reported the processing times in the below table in case others are interested, though it’s probably noting I have access to a machine with 28 CPU cores, so time differences may be less dramatic on regular desktops.
_______________________________
| Method | k | Time |
| tsclust | 2 | 18.584 seconds |
| tsclust | 3 | 25.746 seconds |
| tsclust | 4 | 15.37 seconds |
| pam | 2 | 6.195 minutes |
| pam | 3 | 6.231 minutes |
| pam | 4 | 9.658 minutes |
_______________________________