r cluster-analysis hierarchical-clustering

How could I find out how many samples with loss > the median loss; and how many with loss using hierarchical clustering in R

How could I find out how many samples with loss > the median loss; and how many with loss using hierarchical clustering in R. I am using the dataset Allstate claim severity, I think the numeric attributes are normalized these have values between 0 and 1

This my code:

claims<-read.csv("train.csv")
idx<-sample(1:dim(claims)[1],10000) #10000 random samples
claimsSample<-claims[idx,118:131] #retrieve the numeric features
distances<-dist(claimsSample,method="euclidean")
clusterClaims<-hclust(distances, method = "ward.D")
plot(clusterClaims)
clusterGroups<- cutree(clusterClaims,k=9)

So, How I find the median and the samples ???

Solution

You should actually provide an example dataset, or orientate other SO users to the dataset of interest. "loss" can mean a lot of things...

So we can try something like this:

#claims = read.csv("https://raw.githubusercontent.com/Architectshwet/Allstate-Claims-Severity-Data/master/Datasets/train.csv")
set.seed(111)
idx<-sample(nrow(claims),10000) 
claimsSample<-claims[idx,118:131] 
distances<-dist(claimsSample,method="euclidean")
clusterClaims<-hclust(distances, method = "ward.D")
clusterGroups<- cutree(clusterClaims,k=9)

The clusterGroups labels are given the same order as your rows, so below I get a vector (TRUE/FALSE) that represents whether an observation in your claimsSample is more than the median in claimsSample, and table it according to the group:

results = table(clusterGroups,claims$loss[idx] > median(claims$loss[idx]))

clusterGroups FALSE TRUE
            1   816  621
            2   691  687
            3   405  382
            4   886 1055
            5   493  499
            6   249  256
            7   462  481
            8   530  502
            9   468  517