python scikit-learn k-means metrics biopython

Correlation Distance Metric and Sum of Squared Errors

I couldn't find a way with scikit-learn, to use the correlation distance metric on K-Means - it is necessary for my genes expression dataset.

But when searching the internet, I found this great library: biopython - which has the ability to use the correlation distance metric on K-Means.

However, unlike scikit-learn, I cannot get the inertia / sum of squared errors, and therefore I cannot make a choice of the optimal number of K (clusters) using the 'Elbow Method' (there is only an option to get the "error" value which is the "within-cluster sum of distances" - not squared!): https://biopython.org/docs/1.75/api/Bio.Cluster.html

How can I do both: use correlation distance metric and get the SSE?

Solution

The sum of squared error is more easily implemented than the correlation distance metric, so I would advise you to use biopython together with the following helper function. It should compute the sum of squared errors for you from the data (assumed to be a numpy array) and biopython's clusterid output.

def SSE(data, clusterid):
    """
    Computes the sum of squared error of the data classification.
    
    Arguments:
        data: nrows x ncolumns array containing the data values.
        clusterid: array containing the number of the cluster to which each item was assigned by biopython.
    """
    
    number_of_classes = int(clusterid.max()) + 1 #Python convention: first index is 0
    
    sse = 0.0
    for i in range(number_of_classes):
        cluster = data[clusterid==i]
        sse += cluster.std(ddof=len(cluster)-1)**2
    return sse