Search code examples
rcluster-analysisk-means

Finding clustering results in R


I'm working with a CSV dataset called productQuality1.1, which contains 5 columns, with Median being my product quality performance used to determine the clustering results. I have already found out that the best k cluster number is 2. How can I get the clustering results for my data? I have pasted the dput of my data below:

structure(list(weld.type.ID = 1:33, weld.type = structure(c(29L, 
11L, 16L, 4L, 28L, 17L, 19L, 5L, 24L, 27L, 21L, 32L, 12L, 20L, 
26L, 25L, 3L, 7L, 13L, 22L, 33L, 1L, 9L, 10L, 18L, 15L, 31L, 
8L, 23L, 2L, 14L, 6L, 30L), .Label = c("1,40,Material A", "1,40S,Material C", 
"1,80,Material A", "1,STD,Material A", "1,XS,Material A", "10,10S,Material C", 
"10,160,Material A", "10,40,Material A", "10,40S,Material C", 
"10,80,Material A", "10,STD,Material A", "10,XS,Material A", 
"13,40,Material A", "13,40S,Material C", "13,80,Material A", 
"13,STD,Material A", "13,XS,Material A", "14,40,Material A", 
"14,STD,Material A", "14,XS,Material A", "15,STD,Material A", 
"15,XS,Material A", "2,10S,Material C", "2,160,Material A", "2,40,Material A", 
"2,40S,Material C", "2,80,Material A", "2,STD,Material A", "2,XS,Material A", 
"4,80,Material A", "4,STD,Material A", "6,STD,Material A", "6,XS,Material A"
), class = "factor"), alpha = c(281L, 196L, 59L, 96L, 442L, 98L, 
66L, 30L, 68L, 43L, 35L, 44L, 23L, 14L, 24L, 38L, 8L, 8L, 5L, 
19L, 37L, 38L, 6L, 11L, 29L, 6L, 16L, 6L, 16L, 3L, 4L, 9L, 12L
), beta = c(7194L, 4298L, 3457L, 2982L, 4280L, 3605L, 2229L, 
1744L, 2234L, 1012L, 1096L, 1023L, 1461L, 1303L, 531L, 233L, 
630L, 502L, 328L, 509L, 629L, 554L, 358L, 501L, 422L, 566L, 403L, 
211L, 159L, 268L, 167L, 140L, 621L), Median = c(0.0375507383753025, 
0.043546015959685, 0.0166888869351212, 0.0310875876067419, 0.0935470294716035, 
0.0263798143584636, 0.0286213698125569, 0.0167296957822645, 0.029403369311426, 
0.0404683392593359, 0.0306699148693358, 0.0409507113292405, 0.0152814823151512, 
0.0103834693100336, 0.0426953962552843, 0.139335880048896, 0.0120333156133183, 
0.0150573864235556, 0.0140547965388361, 0.0354001989345449, 0.0551110033888123, 
0.0636987097619679, 0.0156058684578843, 0.0208640835981798, 0.0636580207464108, 
0.00992440459162821, 0.0374531528739036, 0.0262100640799903, 
0.0898729525910631, 0.00989157442426205, 0.0215577154517479, 
0.0584418091169483, 0.0184528408043719)), class = "data.frame", row.names = c(NA, 
-33L))


Solution

  • I am guessing you more or less know there's two clusters, and you want to see whether clustering gives you a good separation on the Median variable.

    First we look at your data frame:

    summary(productQuality1.1)
      weld.type.ID             weld.type      alpha             beta     
     Min.   : 1    1,40,Material A  : 1   Min.   :  3.00   Min.   : 140  
     1st Qu.: 9    1,40S,Material C : 1   1st Qu.:  9.00   1st Qu.: 403  
     Median :17    1,80,Material A  : 1   Median : 24.00   Median : 621  
     Mean   :17    1,STD,Material A : 1   Mean   : 54.24   Mean   :1383  
     3rd Qu.:25    1,XS,Material A  : 1   3rd Qu.: 44.00   3rd Qu.:1744  
     Max.   :33    10,10S,Material C: 1   Max.   :442.00   Max.   :7194  
                   (Other)          :27                                  
         Median        
     Min.   :0.009892  
     1st Qu.:0.016689  
     Median :0.029403  
     Mean   :0.036686  
     3rd Qu.:0.042695  
     Max.   :0.139336  
    

    You can only use alpha and beta, since ID, weld.type are unique entries (like identifiers). We do:

    clus = kmeans(productQuality1.1[,c("alpha","beta")],2)
    productQuality1.1$cluster = factor(clus$cluster)
    

    Note that I use your alpha and beta values are on very different scales to start with. And we can visualize the clustering:

    ggplot(productQuality1.1,aes(x=alpha,y=beta,col=cluster)) + geom_point()

    enter image description here

    It's not going to be easy to cut these observations into 2 clusters just using kmeans because some of them have really high alpha / beta values. We can also look at how your median values are spread:

    ggplot(productQuality1.1,aes(x=alpha,y=beta,col=Median)) + geom_point() + scale_color_viridis_c()

    enter image description here

    Lastly we look at median values:

    ggplot(productQuality1.1,aes(x=Median,col=cluster)) + geom_density()

    enter image description here

    I would say there are some in cluster 2 with a higher median, but some which you don't separate that easily. Given what we see in the scatter plots, might have to think more about how to use the alpha and beta values you have.