I have a dataset (~80k rows) that contains a comma-separated list of tags (skills), for example:
python, java, javascript,
marketing, communications, leadership,
web development, node.js, react
...
Some are as short as 1, others can be as long as 50+ skills. I would like to cluster groups of skills together (Intuitively, people in same cluster would have a very similar set of skills)
First, I use CountVectorizer
from sklearn
to vectorise the list of words and perform a dimensionr reduction using SVD
, reducing it to 50 dimensions (from 500+). Finally, I perform KMeans
Clustering with n=50
, but the results are not optimal -- Groups of skills clustered together seems to be very unrelated.
How should I go about improving the results? I'm also not sure if SVD
is the most appropriate form of dimension reduction for this use case.
I would start with the following approaches:
For any approach (including yours), don't give up before you do some hyper-parameter tuning. Maybe all you need is a smaller representation, or another K (for the KMeans).
Good luck!