Search code examples
pythonnlpdoc2vecopensemanticsearch

Finding similarity of 1 paragraph in different documents with Doc2vec


how to find one target paragraph or document similar to other lists of documents to the target paragraph that is semantically similar.

import os
import gensim
import smart_open
import random
from nltk.tokenize import word_tokenize
# Set file names for train and test data
test_data_dir =('C:\\Users\\hamza\\Desktop\\')
train_file = os.path.join(test_data_dir, 'read-me.txt')
target_file = os.path.join(test_data_dir, 'read-me2.txt')

def read_file(filename):
    
    try:
        with open(filename, 'r') as f:
            data = f.read()
        return data
    
    except IOError:
        print("Error opening or reading input file: ", filename)
        sys.exit()
def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_data = list(read_corpus(train_file))
target_data = word_tokenize(read_file(target_file))

# print(target_data)
# print(test_corpus)
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_data)
# print(f"Word 'noise' appeared {model.wv.get_vecattr('noise', 'count')} times in the training corpus.")
model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs)
inferred_vector = model.infer_vector(target_data)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
print(sims)

Output

[(1, 0.20419511198997498), (2, 0.1924923211336136), (0, 0.10696495324373245)]

Now how I can match target data to train data and how I will know how much they are similar is there any way to scale the similarity into percentage?


Solution

  • Despite the class name Doc2Vec, and the fact that it is based on an algorithm called 'Paragraph Vectors', this algorithm for modeling text has no inherent idea what 'paragraphs' or 'documents' are.

    It simply takes whatever texts you give it – where each text is a list-of-words – & learns a way to plot those texts into a vector-space for comparisons.

    So, using it to "match one target document or paragraph in other documents and bring them as a list how much they are semantically similar" is one possible application:

    • Train a Doc2Vec model with your full set of texts.
    • At the end of training, the model will both (a) have learned a vector for each text; (b) have learned how to infer (via the .infer_vector() method) vectors for new texts that use the same words as the model already knows.
    • Look up the exact vector for one of your training texts with model.dv[tag]. Get a vector for a new text with model.infer_vector(list_of_words). Compare those vectors using any vector operations you'd like.
    • Get a ranked list of known texts (from the training set) that are closest to some target vector with model.dv.most_similar() - you can either supply a tag (to name one of the training documents) or a raw vector (via the positive argument) as the target point.

    If this & the tutorial isn't enough to make progress, you should better explain more about what your data is, what you've tried so far, and where things haven't yet worked – with as much of your code, and precise info about what has and hasn't been achieved yet, as possible.

    (It's nearly impossible to give a helpful answer to "guide me through this generic underspecified project". But if you say instead – "I have data D & want to achieve well-described goal G. I've tried X, but only had result or error Y so far, when my ideal result would be more like Z. What would help me get from my progress Y so far, to my desired result Z?" – then it is possible to give tangible tips/pointers/explanations.)