How to KMeans Cluster strings

I'm data engineer, who has limited understanding of ML methods and am trying to get a good strategy that i understand before I start coding. What i'm trying to do is create clusters out of key value pairs with the key being a name, and the value being some kind of list of strings.

The goal is to create natural clusters of the names, based on the similarity of the respective list of strings.

import pandas as pd
df = pd.DataFrame({"name":['lion','leopard','racoon','possum'], 
"features":[
    ['mane', 'teeth', 'tail', 'carnivore'], 
    ['spots', 'teeth', 'tail', 'carnivore'], 
    ['stripes', 'teeth', 'omnivore', 'small'], 
    ['teeth', 'omnivore', 'small']]})
df

For example in this dataset the natural grouping I would expect would be something like lion/leopard and racoon/possum because of the similarity of the words teeth, tail and carnivore

One way i've done already is to compare one entry e.g. lion, iterate through each list and compare it the other values, if a value is found in another list it's assigned a similarity score. However I would love to use something like a k-means clustering algorithm to learn how to use it principally, but also as I think/ hope it will provide a more meaningful series of clusters.

I think where i'm getting hung up is the conversion of the text to numerical representations and how best to do that, after that I feel like there are a few KMeans tutorials I could probably follow, but if anyone has any advice in how to approach this problem i'd be very interested.

Solution

There are a number of ways you could solve this problem. But in general, here are a few straightforward alternatives -

Vector representation - Use one-hot encoding or TF-IDF to represent sentences
Feature extraction (optional) - In the case of large complex sentences, you may want to use a topic model to extract topic-level features.
Clustering - Any clustering method can be used such as K-means

Here is a sample code.

1. Imports & Data

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
import pandas as pd

df = pd.DataFrame({"name":['lion','leopard','racoon','possum'], 
"features":[
    ['mane', 'teeth', 'tail', 'carnivore'], 
    ['spots', 'teeth', 'tail', 'carnivore'], 
    ['stripes', 'teeth', 'omnivore', 'small'], 
    ['teeth', 'omnivore', 'small']]})

print(df)

      name                           features
0     lion     [mane, teeth, tail, carnivore]
1  leopard    [spots, teeth, tail, carnivore]
2   racoon  [stripes, teeth, omnivore, small]
3   possum           [teeth, omnivore, small]

2. Vector representation

You can use one-hot encoded sentences using multi-label binarizer from sklearn

mlb = MultiLabelBinarizer()
vec = mlb.fit_transform(df['features'])
vectors = pd.DataFrame(vec, columns=mlb.classes_)
vectors

   carnivore  mane  omnivore  small  spots  stripes  tail  teeth
0          1     1         0      0      0        0     1      1
1          1     0         0      0      1        0     1      1
2          0     0         1      1      0        1     0      1
3          0     0         1      1      0        0     0      1

OR you can use tf-idf vectorizer from sklearn

tfidf = TfidfVectorizer()
vec = tfidf.fit_transform(df['features'].apply(' '.join).to_list())
vectors = pd.DataFrame(vec.todense(), columns=tfidf.get_feature_names())
print(vectors)

   carnivore      mane  omnivore     small     spots   stripes      tail  \
0   0.497096  0.630504  0.000000  0.000000  0.000000  0.000000  0.497096   
1   0.497096  0.000000  0.000000  0.000000  0.630504  0.000000  0.497096   
2   0.000000  0.000000  0.497096  0.497096  0.000000  0.630504  0.000000   
3   0.000000  0.000000  0.640434  0.640434  0.000000  0.000000  0.000000   

      teeth  
0  0.329023  
1  0.329023  
2  0.329023  
3  0.423897

3. Feature extraction (optional)

Next, we can optionally use LDA from sklearn to create topics as features for clustering in the next step. Note, you can use other dimensionality reduction or decomposition methods here, but LDA is specifically for topic modelling and is highly interpretable (as shown below) so I am using that.

Let's assume the data has 2 topics.

#Using LDA to create topic level features

lda = LatentDirichletAllocation(n_components=2, verbose=0)
lda_features = lda.fit_transform(vec)
lda_features

array([[0.19035075, 0.80964925],
       [0.19035062, 0.80964938],
       [0.81496776, 0.18503224],
       [0.79598858, 0.20401142]])

To see how LDA decided the topics, it's useful to check the topic-word matrix to understand the composition of the topics.

#Topic-word matrix

pd.DataFrame(lda.components_, 
             index=['topic1', 'topic2'], 
             columns=tfidf.get_feature_names()).round(1)

        carnivore  mane  omnivore  small  spots  stripes  tail  teeth
topic1        0.5   0.5       1.6    1.6    0.5      1.1   0.5    1.3
topic2        1.5   1.1       0.5    0.5    1.1      0.5   1.5    1.1

As you can see, the words that represent the "carnivore" animals topic are part of the second topic, while words that represent "omnivore" animals represent the first topic. Based on your data and complexity patterns in content the number of latent topics your data contains it would be better to use grid search to find the optimal number of topics for your model. Or, you can just make assumptions as I have.

4. Clustering

Finally, let's use k-means clustering to bucket the sentences by similarity in features.

First, let's cluster WITHOUT using LDA.

#Using k-means directly on the one-hot vectors OR Tfidf Vectors

kmeans = KMeans(n_clusters=2)
kmeans.fit(vec)
df['pred'] = kmeans.predict(vec)
print(df)

      name                           features  pred
0     lion     [mane, teeth, tail, carnivore]     0
1  leopard    [spots, teeth, tail, carnivore]     0
2   racoon  [stripes, teeth, omnivore, small]     1
3   possum           [teeth, omnivore, small]     1

Next, we do the same but this time using LDA features.

# clustering the topic level features

kmeans = KMeans(n_clusters=2)
kmeans.fit(lda_features)
df['pred'] = kmeans.predict(lda_features)

      name                           features  pred
0     lion     [mane, teeth, tail, carnivore]     0
1  leopard    [spots, teeth, tail, carnivore]     0
2   racoon  [stripes, teeth, omnivore, small]     1
3   possum           [teeth, omnivore, small]     1

NOTE: When working with any clustering, the labels tend to change each time you rerun but don't disturb the clusters unless data/params change. Meaning, sometimes you may see cluster 0 labeled as cluster 1 and vice versa.