Search code examples
rcluster-analysis

How to cluster by trend instead of by distance in R?


The k-medoids in the clara() function uses distance to form clusters so I get this pattern:

a <- matrix(c(0,1,3,2,0,.32,1,.5,0,.35,1.2,.4,.5,.3,.2,.1,.5,.2,0,-.1), byrow=T, nrow=5)
cl <- clara(a,2)
matplot(t(a),type="b", pch=20, col=cl$clustering) 

clustering by clara()

But I want to find a clustering method that assigns a cluster to each line according to its trend, so lines 1, 2 and 3 belong to one cluster and lines 4 and 5 to another.


Solution

  • This question might be better suited to stats.stackexchange.com, but here's a solution anyway.

    Your question is actually "How do I pick the right distance metric?". Instead of Euclidean distance between these vectors, you want a distance that measures similarity in trend.

    Here's one option:

    a1 <- t(apply(a,1,scale))
    a2 <- t(apply(a1,1,diff))
    
    cl <- clara(a2,2)
    matplot(t(a),type="b", pch=20, col=cl$clustering) 
    

    enter image description here

    Instead of defining a new distance metric, I've accomplished essentially the same thing by transforming the data. First scaling each row, so that we can compare relative trends without differences in scale throwing us off. Next, we just convert the data to the differences.

    Warning: This is not necessarily going to work for all "trend" data. In particular, looking at successive differences only captures a single, limited facet of "trend". You may have to put some thought into more sophisticated metrics.