I have a dataset in this format:
structure(list(id = 1:4, var1_before = c(-0.16, -0.31, -0.26,
-0.77), var2_before = c(-0.7, -1.06, -0.51, -0.81), var3_before = c(2.47,
2.97, 2.91, 3.01), var4_before = c(-1.08, -1.22, -0.92, -1.16
), var5_before = c(0.54, 0.4, 0.46, 0.79), var1_after = c(-0.43,
-0.18, -0.59, 0.64), var2_after = c(-0.69, -0.38, -1.19, -0.77
), var3_after = c(2.97, 3.15, 3.35, 1.52), var4_after = c(-1.11,
-0.99, -1.26, -0.39), var5_after = c(1.22, 0.41, 1.01, 0.24)), class = "data.frame", row.names = c(NA,
-4L))
Every id is unique.
I would like to make two clusters:
First cluster for variables: var1_before, var2_before, var3_before, var4_before, var5_before Second cluster for variables: var1_after, var2_after, var3_after, var4_after, var5_after
I used two-step cluster in spss for this.
How is it possible to make it in R?
This question is quite complex, this is how I'd approach the problem, hoping to help and maybe to start a discussion about it.
Note:
So, you create the first clustering with the before
variables, then the value of the variable changes (after
variables), and you want to see if the id are in the same cluster.
This leads me to think that you only need the first set of clusters (for the before
variables), then see if the ids have changed: no need to do a second clustering, but only see if they've changed from the one cluster to another.
# first, you make your model of clustering, I'm using a simple kmeans
set.seed(1234)
model <- kmeans(df[,2:6],2)
# you put the clusters in the dataset
df$before_cluster <- model$cluster
Now the idea is to calculate the Euclidean distance from the ids with the new variables (after
variables), to the centroids calculated on the before
variabiles:
# for the first cluster
cl1 <- list()
for (i in 1:nrow(df)) {
cl1[[i]] <- dist(rbind(df[i,7:11], model$centers[1,] ))
}
cl1 <- do.call(rbind, cl1)
colnames(cl1) <- 'first'
# for the second cluster
cl2 <- list()
for (i in 1:nrow(df)) {
cl2[[i]] <- dist(rbind(df[i,7:11], model$centers[2,] ))
}
cl2 <- do.call(rbind, cl2)
colnames(cl2) <- 'second'
# put them together
df <- cbind(df, cl1, cl2)
Now the last part, you can define if one has changed the cluster, getting the smallest distance from the centroids (smallest --> it's the new cluster), fetching the "new" cluster.
df$new_cl <- ifelse(df$first < df$second, 1,2)
df
id var1_before var2_before var3_before var4_before var5_before var1_after var2_after var3_after var4_after var5_after first second before_cluster first second new_cl
1 1 -0.16 -0.70 2.47 -1.08 0.54 -0.43 -0.69 2.97 -1.11 1.22 0.6852372 0.8151840 2 0.6852372 0.8151840 1
2 2 -0.31 -1.06 2.97 -1.22 0.40 -0.18 -0.38 3.15 -0.99 0.41 0.7331098 0.5208887 1 0.7331098 0.5208887 2
3 3 -0.26 -0.51 2.91 -0.92 0.46 -0.59 -1.19 3.35 -1.26 1.01 0.6117598 1.1180004 2 0.6117598 1.1180004 1
4 4 -0.77 -0.81 3.01 -1.16 0.79 0.64 -0.77 1.52 -0.39 0.24 2.0848381 1.5994765 1 2.0848381 1.5994765 2
Seems they all have changed cluster.