Search code examples
pythonscikit-learncluster-analysistext-mining

Text data clustering with python


I am currently trying to cluster a list of sequences based on their similarity using python.

ex:

DFKLKSLFD

DLFKFKDLD

LDPELDKSL
...

The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.

I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.

Does anyone have any suggestions regarding other suitable clustering algorithms for this case?

Thank you!!


Solution

  • sklearn actually does show this example using DBSCAN, just like Luke once answered here.

    This is based on that example, using !pip install python-Levenshtein. But if you have pre-calculated all distances, you could change the custom metric, as shown below.

    from Levenshtein import distance
    
    import numpy as np
    from sklearn.cluster import dbscan
    
    data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
    
    def z:
        i, j = int(x[0]), int(y[0])     # extract indices
        return distance(data[i], data[j])
    
    X = np.arange(len(data)).reshape(-1, 1)
    
    dbscan(X, metric=lev_metric, eps=5, min_samples=2)
    

    And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of

    def pre_lev_metric(x, y):
        i, j = int(x[0]), int(y[0])     # extract indices
        return DISTANCES[i,j]
    

    Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.

    I had to install like this

    !pip uninstall -y enum34
    !pip install scikit-learn-extra
    

    Than I was able to create clusters with;

    from sklearn_extra.cluster import KMedoids
    import numpy as np
    
    from Levenshtein import distance
    
    data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]
    
    def lev_metric(x, y):
        i, j = int(x[0]), int(y[0])     # extract indices
        return distance(data[i], data[j])
    
    X = np.arange(len(data)).reshape(-1, 1)
    
    kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)
    

    The labels/centers are in

    kmedoids.labels_
    kmedoids.cluster_centers_