Search code examples
routputcluster-analysisk-meanssummary

How to see the elements hiding under "Other" in the output of a summary in R?


I'm using the following data set to perform a cluster analysis on categorical data - link to data set - using the following packages in R:

library(cluster)
library(dplyr)
library(ggplot2)
library(readr)

With the following code, I get to observe what the profile of clients is within 5 clusters (NB: I'm picking 5 clusters instead of 7 or 8 to keep things more or less simple):

df.torun <- subset(df.bank, select = -c(loan, contact, day, month, duration, campaign, pdays, previous, poutcome, y))

gower_dist <- daisy(df.torun, metric = "gower")

gower_mat <- as.matrix(gower_dist)

sil_width <- c(NA)
for(i in 2:8){
  pam_fit <- pam(gower_dist, diss = TRUE, k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

plot(1:8, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette width")
lines(1:8, sil_width)

k <- 5
pam_fit <- pam(gower_dist, diss = TRUE, k)
pam_results <- df.torun %>% 
  mutate(cluster = pam_fit$clustering) %>% 
  group_by(cluster) %>% 
  do(the_summary = summary(.))
pam_results$the_summary

As you'll be able to see if you run this script using the data I shared, you'll get a lot of information about the profile o clients in the following categories: age, job, marital, education, default, balance and housing. Here's a screenshot of the results I get for cluster 1:

Results I get for cluster 1

As can be seen in the image above, under the job column, some of the results are "hiding" under the category (Other).

My question: what code can I use to list all the words from the job column that are "hiding" under (Other)?

Thank you very much for your help!


Solution

  • You may use maxsum=. Example:

    d <- data.frame(x=gl(10, 5), y=rnorm(50))
    
    summary(d)
    #       x            y          
    # 1      : 5   Min.   :-1.7459  
    # 2      : 5   1st Qu.:-0.8480  
    # 3      : 5   Median :-0.2293  
    # 4      : 5   Mean   :-0.1439  
    # 5      : 5   3rd Qu.: 0.4109  
    # 6      : 5   Max.   : 2.5951  
    # (Other):20            
    
    summary(d, maxsum=11)
    #  x           y          
    # 1 :5   Min.   :-1.7459  
    # 2 :5   1st Qu.:-0.8480  
    # 3 :5   Median :-0.2293  
    # 4 :5   Mean   :-0.1439  
    # 5 :5   3rd Qu.: 0.4109  
    # 6 :5   Max.   : 2.5951  
    # 7 :5                    
    # 8 :5                    
    # 9 :5                    
    # 10:5