Search code examples
pythonnlpspacy

Unsure of how to get started with using NLP for analyzing user feedback


I have ~138k records of user feedback that I'd like to analyze to understand broad patterns in what our users are most often saying. Each one has a rating between 1-5 stars, so I don't need to do any sort of sentiment analysis. I'm mostly interested in splitting the dataset into >=4 stars to see what we're doing well and <= 3 stars to see what we need to improve upon.

One key problem I'm running into is that I expect to see a lot of n-grams. Some of these I know, like "HOV lane", "carpool lane", "detour time", "out of my way", etc. But I also want to detect common bi- and tri-grams programmatically. I've been playing around with Spacy a bit, but it doesn't seem to have any capability to do analysis on the corpus level, only on the document level.

Ideally my pipeline would look something like this (I think):

  1. Import a list of known n-grams into the tokenizer

  2. Process each string into a tokenized document, removing punctuation, stopwords, etc, while respecting the known n-grams during tokenization (ie, "HOV lane" should be a single noun token)

  3. Identify the most common bi- and tri- grams in the corpus that I missed

  4. Re-tokenize using the found n-grams

  5. Split by rating (>=4 and <=3)

  6. Find the most common topics for each split of data in the corpus

I can't seem to find a single tool, or even a collection of tools, that will allow me to do what I want here. Am I approaching this the wrong way somehow? Any pointers on how to get started would be greatly appreciated!


Solution

  • Bingo State of the art results for your problem!

    Its called - Zero-Short learning. State-of-the-art NLP models for text classification without annotated data.

    For Code and details read the blog - https://joeddav.github.io/blog/2020/05/29/ZSL.html

    Let me know if it works for you or for any other help.