Search code examples
pythonscikit-learnk-meansmetricsbiopython

Correlation Distance Metric and Sum of Squared Errors


I couldn't find a way with scikit-learn, to use the correlation distance metric on K-Means - it is necessary for my genes expression dataset.

But when searching the internet, I found this great library: biopython - which has the ability to use the correlation distance metric on K-Means.

However, unlike scikit-learn, I cannot get the inertia / sum of squared errors, and therefore I cannot make a choice of the optimal number of K (clusters) using the 'Elbow Method' (there is only an option to get the "error" value which is the "within-cluster sum of distances" - not squared!): https://biopython.org/docs/1.75/api/Bio.Cluster.html

How can I do both: use correlation distance metric and get the SSE?


Solution

  • The sum of squared error is more easily implemented than the correlation distance metric, so I would advise you to use biopython together with the following helper function. It should compute the sum of squared errors for you from the data (assumed to be a numpy array) and biopython's clusterid output.

    def SSE(data, clusterid):
        """
        Computes the sum of squared error of the data classification.
        
        Arguments:
            data: nrows x ncolumns array containing the data values.
            clusterid: array containing the number of the cluster to which each item was assigned by biopython.
        """
        
        number_of_classes = int(clusterid.max()) + 1 #Python convention: first index is 0
        
        sse = 0.0
        for i in range(number_of_classes):
            cluster = data[clusterid==i]
            sse += cluster.std(ddof=len(cluster)-1)**2
        return sse