Search code examples
rcluster-analysis

Two-step cluster


I have a dataset in this format:

structure(list(id = 1:4, var1_before = c(-0.16, -0.31, -0.26, 
-0.77), var2_before = c(-0.7, -1.06, -0.51, -0.81), var3_before = c(2.47, 
2.97, 2.91, 3.01), var4_before = c(-1.08, -1.22, -0.92, -1.16
), var5_before = c(0.54, 0.4, 0.46, 0.79), var1_after = c(-0.43, 
-0.18, -0.59, 0.64), var2_after = c(-0.69, -0.38, -1.19, -0.77
), var3_after = c(2.97, 3.15, 3.35, 1.52), var4_after = c(-1.11, 
-0.99, -1.26, -0.39), var5_after = c(1.22, 0.41, 1.01, 0.24)), class = "data.frame", row.names = c(NA, 
-4L))

Every id is unique.

I would like to make two clusters:

First cluster for variables: var1_before, var2_before, var3_before, var4_before, var5_before
Second cluster for variables: var1_after, var2_after, var3_after, var4_after, var5_after

I used two-step cluster in spss for this.

How is it possible to make it in R?


Solution

  • This question is quite complex, this is how I'd approach the problem, hoping to help and maybe to start a discussion about it.

    Note:

    • this is how I think I could approach the problem;
    • I do not know the two-step clustering, so I use a kmeans;
    • it's based on your data, but you can easily generalize it: I've made it dependent to your data because it's simpler to explain.

    So, you create the first clustering with the before variables, then the value of the variable changes (after variables), and you want to see if the id are in the same cluster.

    This leads me to think that you only need the first set of clusters (for the before variables), then see if the ids have changed: no need to do a second clustering, but only see if they've changed from the one cluster to another.

    # first, you make your model of clustering, I'm using a simple kmeans
    set.seed(1234)
    model <- kmeans(df[,2:6],2)
    
    # you put the clusters in the dataset
    df$before_cluster <- model$cluster
    

    Now the idea is to calculate the Euclidean distance from the ids with the new variables (after variables), to the centroids calculated on the before variabiles:

    # for the first cluster
    cl1 <- list()
    for (i in 1:nrow(df)) {
                          cl1[[i]] <- dist(rbind(df[i,7:11], model$centers[1,] ))
                          }
    
    cl1 <- do.call(rbind, cl1)
    colnames(cl1) <- 'first'
    
    # for the second cluster
    cl2 <- list()
    for (i in 1:nrow(df)) {
                          cl2[[i]] <- dist(rbind(df[i,7:11], model$centers[2,] ))
                          }
    
    cl2 <- do.call(rbind, cl2)
    colnames(cl2) <- 'second'
    
    # put them together 
    df <- cbind(df, cl1, cl2)
    

    Now the last part, you can define if one has changed the cluster, getting the smallest distance from the centroids (smallest --> it's the new cluster), fetching the "new" cluster.

    df$new_cl <- ifelse(df$first < df$second, 1,2)
    df
      id var1_before var2_before var3_before var4_before var5_before var1_after var2_after var3_after var4_after var5_after     first    second before_cluster     first    second new_cl
    1  1       -0.16       -0.70        2.47       -1.08        0.54      -0.43      -0.69       2.97      -1.11       1.22 0.6852372 0.8151840              2 0.6852372 0.8151840      1
    2  2       -0.31       -1.06        2.97       -1.22        0.40      -0.18      -0.38       3.15      -0.99       0.41 0.7331098 0.5208887              1 0.7331098 0.5208887      2
    3  3       -0.26       -0.51        2.91       -0.92        0.46      -0.59      -1.19       3.35      -1.26       1.01 0.6117598 1.1180004              2 0.6117598 1.1180004      1
    4  4       -0.77       -0.81        3.01       -1.16        0.79       0.64      -0.77       1.52      -0.39       0.24 2.0848381 1.5994765              1 2.0848381 1.5994765      2
    

    Seems they all have changed cluster.