How could I find out how many samples with loss > the median loss; and how many with loss using hierarchical clustering in R. I am using the dataset Allstate claim severity, I think the numeric attributes are normalized these have values between 0 and 1
This my code:
claims<-read.csv("train.csv")
idx<-sample(1:dim(claims)[1],10000) #10000 random samples
claimsSample<-claims[idx,118:131] #retrieve the numeric features
distances<-dist(claimsSample,method="euclidean")
clusterClaims<-hclust(distances, method = "ward.D")
plot(clusterClaims)
clusterGroups<- cutree(clusterClaims,k=9)
So, How I find the median and the samples ???
You should actually provide an example dataset, or orientate other SO users to the dataset of interest. "loss" can mean a lot of things...
So we can try something like this:
#claims = read.csv("https://raw.githubusercontent.com/Architectshwet/Allstate-Claims-Severity-Data/master/Datasets/train.csv")
set.seed(111)
idx<-sample(nrow(claims),10000)
claimsSample<-claims[idx,118:131]
distances<-dist(claimsSample,method="euclidean")
clusterClaims<-hclust(distances, method = "ward.D")
clusterGroups<- cutree(clusterClaims,k=9)
The clusterGroups
labels are given the same order as your rows, so below I get a vector (TRUE/FALSE) that represents whether an observation in your claimsSample is more than the median in claimsSample, and table it according to the group:
results = table(clusterGroups,claims$loss[idx] > median(claims$loss[idx]))
clusterGroups FALSE TRUE
1 816 621
2 691 687
3 405 382
4 886 1055
5 493 499
6 249 256
7 462 481
8 530 502
9 468 517