python scikit-learn cluster-analysis text-mining

Text data clustering with python

I am currently trying to cluster a list of sequences based on their similarity using python.

ex:

DFKLKSLFD

DLFKFKDLD

LDPELDKSL
...

The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.

I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.

Does anyone have any suggestions regarding other suitable clustering algorithms for this case?

Thank you!!

Solution

sklearn actually does show this example using DBSCAN, just like Luke once answered here.

This is based on that example, using !pip install python-Levenshtein. But if you have pre-calculated all distances, you could change the custom metric, as shown below.

from Levenshtein import distance

import numpy as np
from sklearn.cluster import dbscan

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def z:
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of

def pre_lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return DISTANCES[i,j]

Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.

I had to install like this

!pip uninstall -y enum34
!pip install scikit-learn-extra

Than I was able to create clusters with;

from sklearn_extra.cluster import KMedoids
import numpy as np

from Levenshtein import distance

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)

The labels/centers are in

kmedoids.labels_
kmedoids.cluster_centers_