Search code examples
c#f#hierarchical-clusteringaccord.net

Find partial membership with KMeans clustering algorithm


I can calculate cluster membership with KMeans pretty easily:

open System
open System.IO
open Utils
open Accord
open Accord.Math 
open Accord.MachineLearning

let vals = [|
    [|1.0; 2.0; 3.0; 2.0|]
    [|1.1; 1.9; 3.1; 4.0|]
    [|2.0; 3.0; 4.0; 4.0|]    
    [|3.0; 3.1; 2.0; 3.0|]
    [|2.0; 4.0; 3.0; 6.0|]
    [|1.0; 5.0; 5.0; 7.0|]
    [|4.0; 3.0; 6.0; 8.0|]
    [|5.0; 4.0; 3.0; 6.0|]
    [|6.0; 4.0; 8.0; 7.0|]
    [|5.0; 6.0; 5.0; 9.0|]
    [|4.0; 2.0; 7.0; 8.0|]
    [|8.0; 9.0; 3.1; 2.2|]
    [|8.0; 9.0; 2.0; 2.0|]
    [|10.0; 2.0; 3.0; 2.0|]
    [|10.1; 1.9; 3.1; 4.0|]
    [|20.0; 3.0; 4.0; 4.0|]
    [|22.0; 7.0; 2.0; 3.0|]
    [|21.0; 4.0; 3.0; 6.0|]
|]

let kmeans = new KMeans 5
let clusterModel = kmeans.Learn vals
let clusters = clusterModel.Decide vals

Can I calculate partial membership with the standard KMeans algorithm? A coworker suggested using the mean and variances of the cluster members to determine proportional membership and today I've been looking into fuzzy sets and their implementations for F#. For example, here is some documentation for the Accord.net implementation for fuzzy sets. I can translate/run the example for F# but at first glance, I don't see an easy way to get the data from my Kmeans run above to fit the format of assigning partial membership.

Questions:

  1. How would I use the mean/variance of cluster members to calculate partial membership?

  2. Is there an easy way to calculate partial membership with KMeans clustering with the Accord.net library?

  3. The KMeans algorithm in Accord.net is simple to implement; should I spend some time trying to learn this method of clustering/membership to suit my problem rather than try and force Kmeans clustering to suit my needs?


Solution

  • As mentioned by Tomas, Accord.NET already gives you many of the building blocks. In particular, calling clusterModel.Scores gives you the (negative) distances to the cluster centroids, see source code

    From the negative distances, you can compute an approximate class membership score by taking exponentials, similar to what you would do to compute a Gaussian PDF. In F#, that would look like:

    // Scores returns the negative distances between each point
    // and the cluster centroid
    let negDistances = clusterModel.Scores vals
    // Compute an estimated cluster assigment score
    let clusterMembership =
        negDistances
        |> Array.map (fun distances ->
            // Take the Exponential of the (negative) distances,
            // as in computing a Gaussian pdf
            let expDist = distances |> Array.map Math.Exp
            let total = Array.sum expDist
            expDist
            |> Array.map (fun d -> d/total)
        )
    

    There are a couple of caveats here:

    • Standard KMeans in Accord uses Euclidean distances, meaning that each direction carries the same weight. Depending on the nature of your data, this may not lead to reasonable results (picture 2 clusters, each shaped like a long cigar)
    • The above class membership calculcation is not taking cluster covariance into account either. To be closer to truth, you would have to compute Bhattacharyya distance, exponentiate, and then scale by inverse det of the covariance matrix. This will fail for singleton clusters.

    Regarding your third question: I would not re-implement. It may seem easy initially, but there are usually plenty of corner cases and stability issues that you only run into after some time.