nlp data-science topic-modeling cosine-similarity word-embedding

How to measure how distinct a document is based on predefined linguistic categories?

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number of words in each category, and calculating a proportion score for each category by converting the raw word counts into a percentage based on total words used in the text.

                 n-power   n-achieve  n-affiliation
Document1        0.010      0.025      0.100  
Document2        0.045      0.010      0.050
:                :          :          :
:                :          :          :
Document100000   0.100      0.020      0.010

For each document, I would like to get a measure of distinctiveness that indicates the degree to which the content of a document on the three psychological categories differs from the average content of all documents (i.e., the prototypical document in my sample). Is there a way to do this?

Solution

Essentially what you have is a clustering problem. Currently you made a representation of each of your documents with 3 numbers, lets call them a vector (essentially you cooked up some embeddings). To do what you want you can 1) Calculate an average vector for the whole set. Basically add up all numbers in each column and divide by the number of documents. 2) Pick a metric you like which will reflect an alignment of your document vectors with an average. You can just use (Euclidian) sklearn.metrics.pairwise.euclidean_distances or cosine sklearn.metrics.pairwise.cosine_distances X will be you list of document vectors and Y will be a single average vector in the list. This is a good place to start.

If I would do it I would ignore average vector approach as you are in fact dealing with clustering problem. So I would use KMeans see more here guide

Hope this helps!