Need advice on a clustering model. There is a list of customers -> customers have a list of products -> Each product contains several words. I want to cluster clients into several groups by type of activity - that is, by general topics.
How would you introduce such a model for clustering into vectors, for example for K-means?
My hypothesis is so far - turn every word into a fasttext vector, select the top 100 words for example on TF-IDF and add * 100 (the size of the fasttext vector) by 100 words, and this will turn out 10,000 columns. Maybe something more economical in computing?
This is very related to a recommendation systems. I'd recommend reading about content-based vs collaborative-filtering recommender systems. An okay introduction is this blog post.
So, you can cluster based on many properties. Your proposed idea might work. If you have domain knowledge about the product, you could appeal to that before looking to word vectors. For example, let's say all the products are shelves. You could vectorize the products directly, say
vec = [
width,
depth,
height,
width * depth, # footprint/surface area is important on its own
width * depth * height,
color, # numeric representation
popularity, # possibly using a metric like sales
]
This is just an example, but it shows how you can directly vectorize your products without resorting to NLP.
If there is no way you can think of to directly vectorize your products, and you don't/can't use collaborative filtering (cold start problem, perhaps), then you might want to look at vectorizing the entire product description using Universal Sentence Encoder, which will output 512 dimensional vectors, regardless of input size.