I couldn't find a way with scikit-learn, to use the correlation distance metric on K-Means - it is necessary for my genes expression dataset.
But when searching the internet, I found this great library: biopython - which has the ability to use the correlation distance metric on K-Means.
However, unlike scikit-learn, I cannot get the inertia / sum of squared errors, and therefore I cannot make a choice of the optimal number of K (clusters) using the 'Elbow Method' (there is only an option to get the "error" value which is the "within-cluster sum of distances" - not squared!): https://biopython.org/docs/1.75/api/Bio.Cluster.html
How can I do both: use correlation distance metric and get the SSE?
The sum of squared error is more easily implemented than the correlation distance metric, so I would advise you to use biopython together with the following helper function. It should compute the sum of squared errors for you from the data (assumed to be a numpy array) and biopython's clusterid
output.
def SSE(data, clusterid):
"""
Computes the sum of squared error of the data classification.
Arguments:
data: nrows x ncolumns array containing the data values.
clusterid: array containing the number of the cluster to which each item was assigned by biopython.
"""
number_of_classes = int(clusterid.max()) + 1 #Python convention: first index is 0
sse = 0.0
for i in range(number_of_classes):
cluster = data[clusterid==i]
sse += cluster.std(ddof=len(cluster)-1)**2
return sse