I have a thousand of thousands of elements like these:
[ "business_id_a", [ "Food", "Restaurant","Wine & Pizza"] ]
[ "business_id_b", ["Mexican", "Burgers", "Gastropubs & Wine" ] ]
...
[ "business_id_k", ["Automotive", "Delivery","Whatever"] ]
I want to cluster the business_id using k-means grouping theme by category.
Maybe it not the best option. My idea is to create a kind of Dictionary of Categories, and do it by grouping first all possible categories in any way and then, using the model, grouping the samples as group of business_id by cluster of categories.
Can this work? Which is the best way to do that in Python?
the best option would be to tokenize and vectorize the text first. You can tokenize with NLTK's word tokenizer https://www.nltk.org/api/nltk.tokenize.html
then you can vectorize using something like sklearn's CountVectorizer or TFIDFVectorizer
from there, you can apply k-means