Search code examples
python-3.xnlpnltkcluster-analysis

Python: Clustering Search Keywords


I have a lot of "Search Keywords" for every product in the dataset. I try to cluster products according to their "Search Keywords".

What I'm looking to do is cluster these keywords into clusters of "similar meaning", and create a hierarchy of the clusters (structured in order of summed total number of searches per cluster).

An example cluster - "women's clothing" - would ideally contain keywords along these lines: women's clothing, 1000 ladies wear, 300 women's clothes, 50 ladies' clothing, 6 women wear, 2.

I'm a beginner in NLP. Do you have any suggestions of NLP techniques for this task? Any help will be highly appreciated :-)


Solution

  • I suggest to use some pretrained word vectors,fastText for example, so you don't have to worry with training and training data. What you would need to do:

    • Preproces your labels: Tokenize your labels: women's clothing -> ["women's", "clothing"]. see here
    • Lemmatize: ["women's", "clothing"] -> ["woman", "clothing"] see here
    • Calculate vector for each word: vec1 = model.get_word_vector("woman")
    • Average all the vectors for a given Label: avg= (vec1 + vec2)/2 These average vectors should represent your label. The average vectors of woman and clothing should be in the same region as the average of woman and wear. on the other hand the average vector of man and clothing should be in a different region in the vector space, so your preferred clustering algorithm shall catch it.