Search code examples
pythondictionaryscikit-learnsparse-matrixcosine-similarity

Dynamically assign similarity matrices per document to array for export to JSON


I'm pretty new to Python, so I'm sure it's something simple that I'm not doing, but I can't figure it out. I've created similarity matrices for each of the documents in my corpus, and I want to assign them back to a dictionary with keys of the document names, to keep track of similarities between each document.

However, it keeps assigning the last matrix to each key, rather than the corresponding matrices to the key.

import pandas as pd
import numpy as np
import nltk
import string
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import os

path = "stories/"
token_dict = {}
stemmer = PorterStemmer()

def tokenize(text):
   tokens = nltk.word_tokenize(text)
   stems = stem_tokens(tokens, stemmer)
   return stems

def stem_tokens(tokens, stemmer):
    stemmed_words = []
    for token in tokens:
        stemmed_words.append(stemmer.stem(token))
    return stemmed_words


for subdir, dirs, files in os.walk(path):
    for file in files:
        file_path = subdir + os.path.sep + file
        with open(file_path, "r", encoding = "utf-8") as file:
            story = file
            text = story.read()
            lowers = text.lower()
            map = str.maketrans('', '', string.punctuation)
            no_punctuation = lowers.translate(map)
            token_dict[file.name.split("\\", 1)[1]] = no_punctuation

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())

termarray = tfs.toarray()
nparray = np.array(termarray)
rows, cols = nparray.shape

 similarity = []
 for document in docdict:
    for row in range(0, rows-1):
       similarity = cosine_similarity(tfs[row:row+1], tfs)
       docdict[document] = similarity

Everything works as expected until assigning back.

This produces a dictionary of:

{'98ststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,    0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory1.txt': array([[ 0.10586559,  0.04742287,  0.02478352,  0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]]),
 'alfredststory2.txt': array([[ 0.10586559,  0.04742287,  0.02478352,     0.06587952,  0.12907377,
      0.07661095,  0.06941533,  0.05443182,  0.06616549,  0.0266565 ,
      0.04640984,  0.03356339,  0.02529364,  0.08210173,  0.16172138,
      0.05594719,  0.10231466,  0.03556236,  0.18374215,  0.0588386 ,
      0.16857304,  0.08866461,  0.12510476,  0.07107058,  0.0751615 ,
      0.06371055,  0.16820855,  0.07926561,  0.02590006,  0.03690054,
      0.01513446,  0.04677632,  0.11693509,  1.        ,  0.06086615]])

Each of the documents is assigned the document for the second to last one. While this is just a simple off by one, the real issue is more that they are all assigned the same matrix.

The matrix that I get for one document is as follows:

array([[ 1.        ,  0.07015725,  0.01593837,  0.05618977,  0.03892873,
         0.02434279,  0.06029888,  0.02261425,  0.03531677,  0.02975444,
         0.01835854,  0.02145624,  0.00985163,  0.03645598,  0.0497407 ,
         0.04482995,  0.06677013,  0.03153055,  0.10919878,  0.12029462,
         0.07255828,  0.05499581,  0.06330188,  0.04719668,  0.08909685,
         0.04484428,  0.06725359,  0.04453039,  0.02381673,  0.02639529,
         0.01012012,  0.0218679 ,  0.09989828,  0.10586559,  0.01535069]])

Where this is the corresponding similarities for each of the documents to the very first document. What I want is a dictionary that would look something like this:

{
    story1:
          {
              story1: 1.,
              story2: 0.07015725,
              story3: 0.01593837,
              story4: 0.05618977... 
          }
    story2:
          {
              story1: ...
          }
 }

.. And so on.

A sample data set looks something like this:

story1 = """Four other streets were renamed in Cork at the turn of the last   century to celebrate this event: Wolfe Tone St. (Previously Fair Lane), John Philpot Curran St. (Philpot’s Lane), Emmet (Nelson’s) Place and Sheare’s (Nile) St."""
story2 = """Oliver Plunkett Street was originally named George's Street after George I, the then reigning King of Great Britain and Ireland. In 1920, during the Burning of Cork, large parts of the street were destroyed by British troops."""
story3 = """Alfred Street is a connecting Street between Kent Train Station and MacCurtain Street. Present Cork city centre signage uses letters inspired by the book of Kells. This has been an inspiration for many typefaces in the past, including the Petrie's 'B' typface, and Monotype's 'Column Cille', which was widely used for school textbooks."""

Run through the script, this will produce a matrix of similarities as follows:

[[ 1.          0.05814422  0.06032458]]
[[ 0.05814422  1.          0.21323354]]
[[ 0.06032458  0.21323354  1.        ]]

Where each of these is a 1*n matrix corresponding to a each documents similarities. What I want to turn this into is a dictionary that allows me to see the specific similarity per document to each other document like this:

{
    story1: {
                story1: 1.,
                story2: 0.05814422,
                story3: 0.06032458
            },
    story2: {
                story1: 0.05814422,
                story2: 1.,
                story3: 0.21323354
            },
    story3: {
                story1: 0.06032458,
                story2: 0.21323354,
                story3: 1.
            }
}

I'm sure this is a basic issue, but my knowledge of Python's datastructures is lacking, and any help would be appreciated greatly!


Solution

  • Assuming that you have the following matrix of similarities:

    sim = cosine_similarity(tfs)
    
    In [261]: sim
    Out[261]:
    array([[ 1.        ,  0.09933054,  0.08911641],
           [ 0.09933054,  1.        ,  0.27252107],
           [ 0.08911641,  0.27252107,  1.        ]])
    

    NOTE: we don't need loops for calculating a matrix of similarities

    using Pandas module we can do the following:

    In [262]: df = pd.DataFrame(sim,
                                columns=list(token_dict.keys()),
                                index=list(token_dict.keys()))
    

    DataFrame:

    In [263]: df
    Out[263]:
              story1    story2    story3
    story1  1.000000  0.099331  0.089116
    story2  0.099331  1.000000  0.272521
    story3  0.089116  0.272521  1.000000
    

    Now we can easily convert DataFrame to dict

    In [264]: df.to_dict()
    Out[264]:
    {'story1': {'story1': 1.0000000000000009,
      'story2': 0.099330538266243495,
      'story3': 0.089116410701360893},
     'story2': {'story1': 0.099330538266243495,
      'story2': 0.99999999999999911,
      'story3': 0.27252107037687257},
     'story3': {'story1': 0.089116410701360893,
      'story2': 0.27252107037687257,
      'story3': 1.0}}
    

    or directly to JSON:

    df.to_json('/path/to/file.json')