Search code examples
rcluster-analysishierarchical-clustering

How could I find out how many samples with loss > the median loss; and how many with loss using hierarchical clustering in R


How could I find out how many samples with loss > the median loss; and how many with loss using hierarchical clustering in R. I am using the dataset Allstate claim severity, I think the numeric attributes are normalized these have values between 0 and 1

This my code:

claims<-read.csv("train.csv")
idx<-sample(1:dim(claims)[1],10000) #10000 random samples
claimsSample<-claims[idx,118:131] #retrieve the numeric features
distances<-dist(claimsSample,method="euclidean")
clusterClaims<-hclust(distances, method = "ward.D")
plot(clusterClaims)
clusterGroups<- cutree(clusterClaims,k=9)

So, How I find the median and the samples ???


Solution

  • You should actually provide an example dataset, or orientate other SO users to the dataset of interest. "loss" can mean a lot of things...

    So we can try something like this:

    #claims = read.csv("https://raw.githubusercontent.com/Architectshwet/Allstate-Claims-Severity-Data/master/Datasets/train.csv")
    set.seed(111)
    idx<-sample(nrow(claims),10000) 
    claimsSample<-claims[idx,118:131] 
    distances<-dist(claimsSample,method="euclidean")
    clusterClaims<-hclust(distances, method = "ward.D")
    clusterGroups<- cutree(clusterClaims,k=9)
    

    The clusterGroups labels are given the same order as your rows, so below I get a vector (TRUE/FALSE) that represents whether an observation in your claimsSample is more than the median in claimsSample, and table it according to the group:

    results = table(clusterGroups,claims$loss[idx] > median(claims$loss[idx]))
    
    clusterGroups FALSE TRUE
                1   816  621
                2   691  687
                3   405  382
                4   886 1055
                5   493  499
                6   249  256
                7   462  481
                8   530  502
                9   468  517