Search code examples
rcluster-analysisoutliers

R cluster analysis Ward auto deleting outliers


How can I code in R to duplicate cluster analyses done in SAS which involved method=Ward and the TRIM=10 option to automatically delete 10% of the cases as outliers? (This dataset has 45 variables, each variable with some outlier responses.)

When I searched for R cluster analysis using Ward's method, the trim option was described as something that shortens names rather than something that removes outliers.

If I don't trim the datasets before the cluster analysis, one big cluster emerges with lots of single-case "clusters" representing outlying individuals. With the outlying 10% of cases automatically removed, 3 or 4 meaningful clusters emerge. There are too many variables and cases for me to remove the outliers on a case-by-case basis.

Thanks!


Solution

  • You haven't provided any information on how you want to identify outliers. Assuming the simplest case of removing the top and the bottom 5% of cases of every variable (i.e. on a variable by variable basis), you could do this with quantile function.

    Illustrating using the example from the link above, you could do something like:

    duration = faithful$eruptions
    duration[duration <= quantile(duration,0.95) & duration > quantile(duration,0.05)]