I have a list of duplicate document pairs saved in a csv file. Each ID from column 1 is a duplicate to the corresponding ID in column 2. The file goes something like this:
Document_ID1 Document_ID2
12345 87565
34546 45633
56453 78645
35667 67856
13636 67845
Each Document ID is associated with text that is saved somewhere else. I pulled this text and saved each column of IDs and associated texts into two lsm databases.
So I have db1
which has all the IDs from Document_ID1
as keys and their corresponding texts as the values for the respective keys. Therefore, like a dictionary. Similarly, db2
for all the IDs from Document_ID2
.
So, when I say db1[12345]
, I get the text associated with the ID 12345.
Now, I want to get the cosine similarity scores between each of these pairs to determine their duplicate-ness. Until now I ran a tfidf model to do the same. I created a tfidf matrix with all the documents in db1 as the corpus, and I measured the cosine similarity of each of the tfidf vectors from db2 against the tfidf matrix. For security reasons, I cannot provide the complete code. Code goes like this:
# Generator function to pick one key (document) at a time for comparison against other documents
def generator(db):
for key in db.keys():
text = db[key]
yield text
# Use spaCy to create a function to preprocess text from the generator function
nlp = spacy.load('en')
def spacy(generator_object):
for doc in generator_object:
words = <code to make words lower case, remove stop words, spaces and punctuations>
yield u' '.join(words)
# TF-IDF Vectorizer
tfidf = TfidfVectorizer(min_df = 2)
# Applying tf-idf transformer to each key from db1 individually in the generator function.
tfidf_matrix = tfidf.fit_transform(spacy(generator(db1)))
# Function to calculate cosine similarity values between the tfidf matrix and the tfidf vector of a new key
def similarity(tfidf_vector, tfidf_matrix, keys):
sim_vec = <code to get cosine similarity>
return sim_vec.sort_values(ascending=False)
# Applying tf-idf transformer on db2 keys on a loop and getting cosine similarity scores for each key from db2.
for key in db2.keys():
# Create a new temporary db for each key from db2 to enter into generator function
new = <code to create a temporary new lsm database>
text = db2[key]
new[key] = text
new_key = <code to get next key from the temporary new lsm database>
tfidf_vector = tfidf.transform(spacy_proc(corpus_gen(new)))
similarity_values = similarity(tfidf_vector, tfidf_matrix, list(db1.keys()))
for idx, i in similarity_values.iteritems():
print new_key, idx, i
del new[key]
But this gives me cosine similarity scores against all keys in db1 for each key in db2. Example: If there are 5 keys in db1 and 5 keys in db2, I get 25 rows as result with this code.
What I want is to get the cosine similarity scores for just corresponding key from db1 for the key in db2. Which means if there are 5 keys each in db1 and db2, I should have only 5 rows as a result - the cosine similarity score for each pair of duplicates only.
How should I tweak my code to get that?
Since, there's no definitive answer yet, I'm getting the dataframe with all the rows (25 rows of result as in the example above) and inner-joining/merging it with a dataframe that has the list of duplicate pairs (i.e. the 5 rows of output that I need). That way, the resulting dataframe has the similarity scores for the duplicate document pairs. This is a temporary solution. If anyone can come up with a cleaner solution, I'll accept that as the answer, if it works.