Search code examples
pythonscikit-learnnlpk-means

How to perform K- means clustering over text in python?


I have a thousand of thousands of elements like these:

[ "business_id_a", [ "Food", "Restaurant","Wine & Pizza"] ] 
[ "business_id_b", ["Mexican", "Burgers", "Gastropubs & Wine" ] ]
... 

[ "business_id_k", ["Automotive", "Delivery","Whatever"] ]

I want to cluster the business_id using k-means grouping theme by category.

Maybe it not the best option. My idea is to create a kind of Dictionary of Categories, and do it by grouping first all possible categories in any way and then, using the model, grouping the samples as group of business_id by cluster of categories.

Can this work? Which is the best way to do that in Python?


Solution

  • the best option would be to tokenize and vectorize the text first. You can tokenize with NLTK's word tokenizer https://www.nltk.org/api/nltk.tokenize.html

    then you can vectorize using something like sklearn's CountVectorizer or TFIDFVectorizer

    from there, you can apply k-means