python scikit-learn nlp cluster-analysis dimensionality-reduction

Clustering of Tags

I have a dataset (~80k rows) that contains a comma-separated list of tags (skills), for example:

python, java, javascript,
marketing, communications, leadership,
web development, node.js, react
...

Some are as short as 1, others can be as long as 50+ skills. I would like to cluster groups of skills together (Intuitively, people in same cluster would have a very similar set of skills)

First, I use CountVectorizer from sklearn to vectorise the list of words and perform a dimensionr reduction using SVD, reducing it to 50 dimensions (from 500+). Finally, I perform KMeans Clustering with n=50 , but the results are not optimal -- Groups of skills clustered together seems to be very unrelated.

How should I go about improving the results? I'm also not sure if SVD is the most appropriate form of dimension reduction for this use case.

Solution

I would start with the following approaches:

If you have enough data, try something like word2vec to get an embedding for each tag. You can use pre-trained models, but probably better to train on you own data since it has unique semantics. Make sure you have an OOV embedding for tags that don't appear enough times. Then use K-means, Agglomerative Hierarchical Clustering, or other known clustering methods.
I would construct a weighted undirected-graph, where each tag is a node, and edges represent the number of times 2 tags appeared in the same list. Once the graph is constructed, I would use a community detection algorithm for clustering. Networkx is a very nice library in python that lets you do that.

For any approach (including yours), don't give up before you do some hyper-parameter tuning. Maybe all you need is a smaller representation, or another K (for the KMeans).

Good luck!