I'm data engineer, who has limited understanding of ML methods and am trying to get a good strategy that i understand before I start coding. What i'm trying to do is create clusters out of key value pairs with the key being a name, and the value being some kind of list of strings.
The goal is to create natural clusters of the names, based on the similarity of the respective list of strings.
import pandas as pd
df = pd.DataFrame({"name":['lion','leopard','racoon','possum'],
"features":[
['mane', 'teeth', 'tail', 'carnivore'],
['spots', 'teeth', 'tail', 'carnivore'],
['stripes', 'teeth', 'omnivore', 'small'],
['teeth', 'omnivore', 'small']]})
df
For example in this dataset the natural grouping I would expect would be something like lion/leopard and racoon/possum because of the similarity of the words teeth, tail and carnivore
One way i've done already is to compare one entry e.g. lion, iterate through each list and compare it the other values, if a value is found in another list it's assigned a similarity score. However I would love to use something like a k-means clustering algorithm to learn how to use it principally, but also as I think/ hope it will provide a more meaningful series of clusters.
I think where i'm getting hung up is the conversion of the text to numerical representations and how best to do that, after that I feel like there are a few KMeans tutorials I could probably follow, but if anyone has any advice in how to approach this problem i'd be very interested.
There are a number of ways you could solve this problem. But in general, here are a few straightforward alternatives -
Here is a sample code.
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
import pandas as pd
df = pd.DataFrame({"name":['lion','leopard','racoon','possum'],
"features":[
['mane', 'teeth', 'tail', 'carnivore'],
['spots', 'teeth', 'tail', 'carnivore'],
['stripes', 'teeth', 'omnivore', 'small'],
['teeth', 'omnivore', 'small']]})
print(df)
name features
0 lion [mane, teeth, tail, carnivore]
1 leopard [spots, teeth, tail, carnivore]
2 racoon [stripes, teeth, omnivore, small]
3 possum [teeth, omnivore, small]
You can use one-hot encoded sentences using multi-label binarizer from sklearn
mlb = MultiLabelBinarizer()
vec = mlb.fit_transform(df['features'])
vectors = pd.DataFrame(vec, columns=mlb.classes_)
vectors
carnivore mane omnivore small spots stripes tail teeth
0 1 1 0 0 0 0 1 1
1 1 0 0 0 1 0 1 1
2 0 0 1 1 0 1 0 1
3 0 0 1 1 0 0 0 1
OR you can use tf-idf vectorizer from sklearn
tfidf = TfidfVectorizer()
vec = tfidf.fit_transform(df['features'].apply(' '.join).to_list())
vectors = pd.DataFrame(vec.todense(), columns=tfidf.get_feature_names())
print(vectors)
carnivore mane omnivore small spots stripes tail \
0 0.497096 0.630504 0.000000 0.000000 0.000000 0.000000 0.497096
1 0.497096 0.000000 0.000000 0.000000 0.630504 0.000000 0.497096
2 0.000000 0.000000 0.497096 0.497096 0.000000 0.630504 0.000000
3 0.000000 0.000000 0.640434 0.640434 0.000000 0.000000 0.000000
teeth
0 0.329023
1 0.329023
2 0.329023
3 0.423897
Next, we can optionally use LDA from sklearn to create topics as features for clustering in the next step. Note, you can use other dimensionality reduction or decomposition methods here, but LDA is specifically for topic modelling and is highly interpretable (as shown below) so I am using that.
Let's assume the data has 2 topics.
#Using LDA to create topic level features
lda = LatentDirichletAllocation(n_components=2, verbose=0)
lda_features = lda.fit_transform(vec)
lda_features
array([[0.19035075, 0.80964925],
[0.19035062, 0.80964938],
[0.81496776, 0.18503224],
[0.79598858, 0.20401142]])
To see how LDA decided the topics, it's useful to check the topic-word matrix to understand the composition of the topics.
#Topic-word matrix
pd.DataFrame(lda.components_,
index=['topic1', 'topic2'],
columns=tfidf.get_feature_names()).round(1)
carnivore mane omnivore small spots stripes tail teeth
topic1 0.5 0.5 1.6 1.6 0.5 1.1 0.5 1.3
topic2 1.5 1.1 0.5 0.5 1.1 0.5 1.5 1.1
As you can see, the words that represent the "carnivore" animals topic are part of the second topic, while words that represent "omnivore" animals represent the first topic. Based on your data and complexity patterns in content the number of latent topics your data contains it would be better to use grid search to find the optimal number of topics for your model. Or, you can just make assumptions as I have.
Finally, let's use k-means clustering to bucket the sentences by similarity in features.
First, let's cluster WITHOUT using LDA.
#Using k-means directly on the one-hot vectors OR Tfidf Vectors
kmeans = KMeans(n_clusters=2)
kmeans.fit(vec)
df['pred'] = kmeans.predict(vec)
print(df)
name features pred
0 lion [mane, teeth, tail, carnivore] 0
1 leopard [spots, teeth, tail, carnivore] 0
2 racoon [stripes, teeth, omnivore, small] 1
3 possum [teeth, omnivore, small] 1
Next, we do the same but this time using LDA features.
# clustering the topic level features
kmeans = KMeans(n_clusters=2)
kmeans.fit(lda_features)
df['pred'] = kmeans.predict(lda_features)
name features pred
0 lion [mane, teeth, tail, carnivore] 0
1 leopard [spots, teeth, tail, carnivore] 0
2 racoon [stripes, teeth, omnivore, small] 1
3 possum [teeth, omnivore, small] 1
NOTE: When working with any clustering, the labels tend to change each time you rerun but don't disturb the clusters unless data/params change. Meaning, sometimes you may see cluster 0 labeled as cluster 1 and vice versa.