I performed k-medoid clustering analysis using CRAN cluster
package with R. The data is on a data.frame
called df4 with 13111 obs. of 11 binary and ordinal values. After clustering, I applied the cluster results to the original data.frame
showing corresponding cluster number to user id.
How do I aggregate the binary and ordinal choices according to cluster?
For example, Gender
variable has male/female values and Age
ranges from "18-20", "21-24", "25-34", "35-44", "45-54", "55-64", and "65+”. I want the sum of the male and female values per cluster for variable Gender
and for the categories in Age
.
Here’s the head of my data.frame with cluster label column:
#12 variables because I added the clustering object to the data.frame
#I only included two variables from the R output
> str(df4)
'data.frame': 13111 obs. of 12 variables:
$ Age : Factor w/ 7 levels "18-20","21-24",..: 6 6 6 6 7 6 5 7 6 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 2 1 2 2 …
#I only included three variables from the R output
> head(df4)
Age Gender
1 55-64 Female
2 55-64 Female
3 55-64 Male
4 55-64 Male
5 65+ Male
6 55-64 Female
Here’s a reproducible example similar to my dataset:
age <- c("18-20", "21-24", "25-34", "35-44", "45-54", "55-64", "65+")
gender <- c("Female", "Female", "Male", "Male", "Male", "Male", "Female")
smalldf <- data.frame(age, gender)
#Import cluster package
library(cluster)
#Create dissimilarity matrix
#Gower coefficient for finding distance between mixed variable
smalldaisy4 <- daisy(smalldf, metric = "gower",
type = list(symm = c(2), ordratio = c(1)))
#Set randomization seed
set.seed(1)
#Pam algorithm with 3 clusters
smallk4answers <- pam(smalldaisy4, 3, diss = TRUE)
#Apply cluster IDs to original data frame
smalldf$cluster <- smallk4answers$cluster
Desired result of output (hypothetical):
cluster female male 18-20 21-24 25-34 35-44 45-54 55-64 65+
1 1 1 1 1 2 1 0 3 1 0
2 2 2 1 1 1 0 1 2 0 0
3 3 0 1 1 1 1 1 0 2 3
Let me know if I can provide more information.
It looks like you want to display the two tables from a cluster-by-gender and a cluster-by-age tabluation in one matrix:
with( smalldf, cbind(table(cluster, gender), table(cluster, age) ) )
#----------------
Female Male 18-20 21-24 25-34 35-44 45-54 55-64 65+
1 2 0 1 1 0 0 0 0 0
2 0 4 0 0 1 1 1 1 0
3 1 0 0 0 0 0 0 0 1