I'm working with a CSV dataset called productQuality1.1, which contains 5 columns, with Median being my product quality performance used to determine the clustering results. I have already found out that the best k cluster number is 2. How can I get the clustering results for my data? I have pasted the dput of my data below:
structure(list(weld.type.ID = 1:33, weld.type = structure(c(29L,
11L, 16L, 4L, 28L, 17L, 19L, 5L, 24L, 27L, 21L, 32L, 12L, 20L,
26L, 25L, 3L, 7L, 13L, 22L, 33L, 1L, 9L, 10L, 18L, 15L, 31L,
8L, 23L, 2L, 14L, 6L, 30L), .Label = c("1,40,Material A", "1,40S,Material C",
"1,80,Material A", "1,STD,Material A", "1,XS,Material A", "10,10S,Material C",
"10,160,Material A", "10,40,Material A", "10,40S,Material C",
"10,80,Material A", "10,STD,Material A", "10,XS,Material A",
"13,40,Material A", "13,40S,Material C", "13,80,Material A",
"13,STD,Material A", "13,XS,Material A", "14,40,Material A",
"14,STD,Material A", "14,XS,Material A", "15,STD,Material A",
"15,XS,Material A", "2,10S,Material C", "2,160,Material A", "2,40,Material A",
"2,40S,Material C", "2,80,Material A", "2,STD,Material A", "2,XS,Material A",
"4,80,Material A", "4,STD,Material A", "6,STD,Material A", "6,XS,Material A"
), class = "factor"), alpha = c(281L, 196L, 59L, 96L, 442L, 98L,
66L, 30L, 68L, 43L, 35L, 44L, 23L, 14L, 24L, 38L, 8L, 8L, 5L,
19L, 37L, 38L, 6L, 11L, 29L, 6L, 16L, 6L, 16L, 3L, 4L, 9L, 12L
), beta = c(7194L, 4298L, 3457L, 2982L, 4280L, 3605L, 2229L,
1744L, 2234L, 1012L, 1096L, 1023L, 1461L, 1303L, 531L, 233L,
630L, 502L, 328L, 509L, 629L, 554L, 358L, 501L, 422L, 566L, 403L,
211L, 159L, 268L, 167L, 140L, 621L), Median = c(0.0375507383753025,
0.043546015959685, 0.0166888869351212, 0.0310875876067419, 0.0935470294716035,
0.0263798143584636, 0.0286213698125569, 0.0167296957822645, 0.029403369311426,
0.0404683392593359, 0.0306699148693358, 0.0409507113292405, 0.0152814823151512,
0.0103834693100336, 0.0426953962552843, 0.139335880048896, 0.0120333156133183,
0.0150573864235556, 0.0140547965388361, 0.0354001989345449, 0.0551110033888123,
0.0636987097619679, 0.0156058684578843, 0.0208640835981798, 0.0636580207464108,
0.00992440459162821, 0.0374531528739036, 0.0262100640799903,
0.0898729525910631, 0.00989157442426205, 0.0215577154517479,
0.0584418091169483, 0.0184528408043719)), class = "data.frame", row.names = c(NA,
-33L))
I am guessing you more or less know there's two clusters, and you want to see whether clustering gives you a good separation on the Median variable.
First we look at your data frame:
summary(productQuality1.1)
weld.type.ID weld.type alpha beta
Min. : 1 1,40,Material A : 1 Min. : 3.00 Min. : 140
1st Qu.: 9 1,40S,Material C : 1 1st Qu.: 9.00 1st Qu.: 403
Median :17 1,80,Material A : 1 Median : 24.00 Median : 621
Mean :17 1,STD,Material A : 1 Mean : 54.24 Mean :1383
3rd Qu.:25 1,XS,Material A : 1 3rd Qu.: 44.00 3rd Qu.:1744
Max. :33 10,10S,Material C: 1 Max. :442.00 Max. :7194
(Other) :27
Median
Min. :0.009892
1st Qu.:0.016689
Median :0.029403
Mean :0.036686
3rd Qu.:0.042695
Max. :0.139336
You can only use alpha and beta, since ID, weld.type are unique entries (like identifiers). We do:
clus = kmeans(productQuality1.1[,c("alpha","beta")],2)
productQuality1.1$cluster = factor(clus$cluster)
Note that I use your alpha and beta values are on very different scales to start with. And we can visualize the clustering:
ggplot(productQuality1.1,aes(x=alpha,y=beta,col=cluster)) + geom_point()
It's not going to be easy to cut these observations into 2 clusters just using kmeans because some of them have really high alpha / beta values. We can also look at how your median values are spread:
ggplot(productQuality1.1,aes(x=alpha,y=beta,col=Median)) + geom_point() + scale_color_viridis_c()
Lastly we look at median values:
ggplot(productQuality1.1,aes(x=Median,col=cluster)) + geom_density()
I would say there are some in cluster 2 with a higher median, but some which you don't separate that easily. Given what we see in the scatter plots, might have to think more about how to use the alpha and beta values you have.