Search code examples
pythonnumpyscipyterm-document-matrix

Term document matrix and cosine similarity in Python


I have following situation that I want to address using Python (preferably using numpy and scipy):

  1. Collection of documents that I want to convert to a sparse term document matrix.
  2. Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).

How do I achieve this in Python? I know I can use scipy.sparse.coo_matrix to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix row vectors)?

Thanks.


Solution

  • May I recommend you take a look at scikit-learn? This is a very well regarded library in the Python community with a very simple a consistent API. They have also implemented a cosine similarity metric. This is an example taken from here of how you could do it in 3 lines of code:

    >>> from sklearn.feature_extraction.text import TfidfVectorizer
    
    >>> vect = TfidfVectorizer(min_df=1)
    >>> tfidf = vect.fit_transform(["I'd like an apple",
    ...                             "An apple a day keeps the doctor away",
    ...                             "Never compare an apple to an orange",
    ...                             "I prefer scikit-learn to Orange"])
    >>> (tfidf * tfidf.T).A
    array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
           [ 0.25082859,  1.        ,  0.22057609,  0.        ],
           [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
           [ 0.        ,  0.        ,  0.26264139,  1.        ]])