Search code examples
nlpdatasetspacy

How to identify words with the same meaning in order to reduce number of tags/categories/classes in a dataset


So here is an example of a column in my data-set:

"industries": ["Gaming", "fitness and wellness"]

The industries column has hundreds of different tags, some of which can have the same meaning, for example, some rows have: "Gaming" and some have "video games" and others "Games & consoles".

I'd like to "lemmatize" these tags so I could query the data and not worry about minute differences in the presentation (if they are basically the same).

What is the standard solution in this case?


Solution

  • I don't know that there is a "standard" solution, but I can suggest a couple of approaches, ranked by increasing depth of knowledge, or going from the surface form to the meaning.

    1. String matching
    2. Lemmatisation/stemming
    3. Word embedding vector distance

    String matching is based on the calculating the difference between strings, as a measure of how many characters they share or how many editing steps it takes to transform one into the other. Levenshtein distance is one of the most common ones. However, depending on the size of your data, it might be a bit inefficient to use. This is a really cool approach to find most similar strings in a large data set.

    However, it might not be the most suitable one for your particular data set, as your similarities seem more semantic and less bound to the surface form of the words.

    Lemmatisation/stemming goes beyond the surface by analysing the words apart based on their morphology. In your example, gaming and games both have the same stem game, so you could base your similarity measure on matching stems. This can be better than pure string matching as you can see that *go" and went are related

    Word embeddings go beyond the surface form by encoding meaning as the context in which words appear and as such, might find a semantic similarity between health and *fitness", that is not apparent from the surface at all! The similarity is measured as the cosine distance/similarity between two word vectors, which is basically the angle between the two vectors.

    It seems to me that the third approach might be most suitable for your data.