Search code examples
pythonmachine-learningk-means

How to KMeans Cluster strings


I'm data engineer, who has limited understanding of ML methods and am trying to get a good strategy that i understand before I start coding. What i'm trying to do is create clusters out of key value pairs with the key being a name, and the value being some kind of list of strings.

The goal is to create natural clusters of the names, based on the similarity of the respective list of strings.

import pandas as pd
df = pd.DataFrame({"name":['lion','leopard','racoon','possum'], 
"features":[
    ['mane', 'teeth', 'tail', 'carnivore'], 
    ['spots', 'teeth', 'tail', 'carnivore'], 
    ['stripes', 'teeth', 'omnivore', 'small'], 
    ['teeth', 'omnivore', 'small']]})
df

For example in this dataset the natural grouping I would expect would be something like lion/leopard and racoon/possum because of the similarity of the words teeth, tail and carnivore

One way i've done already is to compare one entry e.g. lion, iterate through each list and compare it the other values, if a value is found in another list it's assigned a similarity score. However I would love to use something like a k-means clustering algorithm to learn how to use it principally, but also as I think/ hope it will provide a more meaningful series of clusters.

I think where i'm getting hung up is the conversion of the text to numerical representations and how best to do that, after that I feel like there are a few KMeans tutorials I could probably follow, but if anyone has any advice in how to approach this problem i'd be very interested.


Solution

  • There are a number of ways you could solve this problem. But in general, here are a few straightforward alternatives -

    1. Vector representation - Use one-hot encoding or TF-IDF to represent sentences
    2. Feature extraction (optional) - In the case of large complex sentences, you may want to use a topic model to extract topic-level features.
    3. Clustering - Any clustering method can be used such as K-means

    Here is a sample code.

    1. Imports & Data

    from sklearn.preprocessing import MultiLabelBinarizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.cluster import KMeans
    import pandas as pd
    
    df = pd.DataFrame({"name":['lion','leopard','racoon','possum'], 
    "features":[
        ['mane', 'teeth', 'tail', 'carnivore'], 
        ['spots', 'teeth', 'tail', 'carnivore'], 
        ['stripes', 'teeth', 'omnivore', 'small'], 
        ['teeth', 'omnivore', 'small']]})
    
    print(df)
    
          name                           features
    0     lion     [mane, teeth, tail, carnivore]
    1  leopard    [spots, teeth, tail, carnivore]
    2   racoon  [stripes, teeth, omnivore, small]
    3   possum           [teeth, omnivore, small]
    

    2. Vector representation

    You can use one-hot encoded sentences using multi-label binarizer from sklearn

    mlb = MultiLabelBinarizer()
    vec = mlb.fit_transform(df['features'])
    vectors = pd.DataFrame(vec, columns=mlb.classes_)
    vectors
    
       carnivore  mane  omnivore  small  spots  stripes  tail  teeth
    0          1     1         0      0      0        0     1      1
    1          1     0         0      0      1        0     1      1
    2          0     0         1      1      0        1     0      1
    3          0     0         1      1      0        0     0      1
    

    OR you can use tf-idf vectorizer from sklearn

    tfidf = TfidfVectorizer()
    vec = tfidf.fit_transform(df['features'].apply(' '.join).to_list())
    vectors = pd.DataFrame(vec.todense(), columns=tfidf.get_feature_names())
    print(vectors)
    
       carnivore      mane  omnivore     small     spots   stripes      tail  \
    0   0.497096  0.630504  0.000000  0.000000  0.000000  0.000000  0.497096   
    1   0.497096  0.000000  0.000000  0.000000  0.630504  0.000000  0.497096   
    2   0.000000  0.000000  0.497096  0.497096  0.000000  0.630504  0.000000   
    3   0.000000  0.000000  0.640434  0.640434  0.000000  0.000000  0.000000   
    
          teeth  
    0  0.329023  
    1  0.329023  
    2  0.329023  
    3  0.423897  
    

    3. Feature extraction (optional)

    Next, we can optionally use LDA from sklearn to create topics as features for clustering in the next step. Note, you can use other dimensionality reduction or decomposition methods here, but LDA is specifically for topic modelling and is highly interpretable (as shown below) so I am using that.

    Let's assume the data has 2 topics.

    #Using LDA to create topic level features
    
    lda = LatentDirichletAllocation(n_components=2, verbose=0)
    lda_features = lda.fit_transform(vec)
    lda_features
    
    array([[0.19035075, 0.80964925],
           [0.19035062, 0.80964938],
           [0.81496776, 0.18503224],
           [0.79598858, 0.20401142]])
    

    To see how LDA decided the topics, it's useful to check the topic-word matrix to understand the composition of the topics.

    #Topic-word matrix
    
    pd.DataFrame(lda.components_, 
                 index=['topic1', 'topic2'], 
                 columns=tfidf.get_feature_names()).round(1)
    
            carnivore  mane  omnivore  small  spots  stripes  tail  teeth
    topic1        0.5   0.5       1.6    1.6    0.5      1.1   0.5    1.3
    topic2        1.5   1.1       0.5    0.5    1.1      0.5   1.5    1.1
    

    As you can see, the words that represent the "carnivore" animals topic are part of the second topic, while words that represent "omnivore" animals represent the first topic. Based on your data and complexity patterns in content the number of latent topics your data contains it would be better to use grid search to find the optimal number of topics for your model. Or, you can just make assumptions as I have.

    4. Clustering

    Finally, let's use k-means clustering to bucket the sentences by similarity in features.

    First, let's cluster WITHOUT using LDA.

    #Using k-means directly on the one-hot vectors OR Tfidf Vectors
    
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(vec)
    df['pred'] = kmeans.predict(vec)
    print(df)
    
          name                           features  pred
    0     lion     [mane, teeth, tail, carnivore]     0
    1  leopard    [spots, teeth, tail, carnivore]     0
    2   racoon  [stripes, teeth, omnivore, small]     1
    3   possum           [teeth, omnivore, small]     1
    

    Next, we do the same but this time using LDA features.

    # clustering the topic level features
    
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(lda_features)
    df['pred'] = kmeans.predict(lda_features)
    
          name                           features  pred
    0     lion     [mane, teeth, tail, carnivore]     0
    1  leopard    [spots, teeth, tail, carnivore]     0
    2   racoon  [stripes, teeth, omnivore, small]     1
    3   possum           [teeth, omnivore, small]     1
    

    NOTE: When working with any clustering, the labels tend to change each time you rerun but don't disturb the clusters unless data/params change. Meaning, sometimes you may see cluster 0 labeled as cluster 1 and vice versa.