Search code examples

Cosine similarity for already known pairs of duplicates

I have a list of duplicate document pairs saved in a csv file. Each ID from column 1 is a duplicate to the corresponding ID in column 2. The file goes something like this:

Document_ID1    Document_ID2
12345           87565
34546           45633
56453           78645
35667           67856
13636           67845

Each Document ID is associated with text that is saved somewhere else. I pulled this text and saved each column of IDs and associated texts into two lsm databases.
So I have db1 which has all the IDs from Document_ID1 as keys and their corresponding texts as the values for the respective keys. Therefore, like a dictionary. Similarly, db2 for all the IDs from Document_ID2.
So, when I say db1[12345], I get the text associated with the ID 12345.

Now, I want to get the cosine similarity scores between each of these pairs to determine their duplicate-ness. Until now I ran a tfidf model to do the same. I created a tfidf matrix with all the documents in db1 as the corpus, and I measured the cosine similarity of each of the tfidf vectors from db2 against the tfidf matrix. For security reasons, I cannot provide the complete code. Code goes like this:

# Generator function to pick one key (document) at a time for comparison against other documents
def generator(db):
    for key in db.keys():
        text = db[key]
        yield text

# Use spaCy to create a function to preprocess text from the generator function
nlp = spacy.load('en')
def spacy(generator_object):
    for doc in generator_object:
        words = <code to make words lower case, remove stop words, spaces and punctuations>
        yield u' '.join(words)

# TF-IDF Vectorizer
tfidf = TfidfVectorizer(min_df = 2)

# Applying tf-idf transformer to each key from db1 individually in the generator function.
tfidf_matrix = tfidf.fit_transform(spacy(generator(db1)))

# Function to calculate cosine similarity values between the tfidf matrix and the tfidf vector of a new key
def similarity(tfidf_vector, tfidf_matrix, keys):    
    sim_vec = <code to get cosine similarity>
    return sim_vec.sort_values(ascending=False)

# Applying tf-idf transformer on db2 keys on a loop and getting cosine similarity scores for each key from db2.
for key in db2.keys():
    # Create a new temporary db for each key from db2 to enter into generator function
    new = <code to create a temporary new lsm database>
    text = db2[key]
    new[key] = text
    new_key = <code to get next key from the temporary new lsm database>
    tfidf_vector = tfidf.transform(spacy_proc(corpus_gen(new)))
    similarity_values = similarity(tfidf_vector, tfidf_matrix, list(db1.keys()))
    for idx, i in similarity_values.iteritems(): 
            print new_key, idx, i
    del new[key]

But this gives me cosine similarity scores against all keys in db1 for each key in db2. Example: If there are 5 keys in db1 and 5 keys in db2, I get 25 rows as result with this code.
What I want is to get the cosine similarity scores for just corresponding key from db1 for the key in db2. Which means if there are 5 keys each in db1 and db2, I should have only 5 rows as a result - the cosine similarity score for each pair of duplicates only.

How should I tweak my code to get that?


  • Since, there's no definitive answer yet, I'm getting the dataframe with all the rows (25 rows of result as in the example above) and inner-joining/merging it with a dataframe that has the list of duplicate pairs (i.e. the 5 rows of output that I need). That way, the resulting dataframe has the similarity scores for the duplicate document pairs. This is a temporary solution. If anyone can come up with a cleaner solution, I'll accept that as the answer, if it works.