Search code examples
pythongensimdoc2vec

Doc2Vec: get text of the label


I've trained Doc2Vec model I'm trying to get predictions.

I use

test_data = word_tokenize("Филип Моррис Продактс С.А.".lower())
model = Doc2Vec.load(model_path)
v1 = model.infer_vector(test_data)
sims = model.docvecs.most_similar([v1])
print(sims)

returns

[('624319', 0.7534812092781067), ('566511', 0.7333904504776001), ('517382', 0.7264763116836548), ('523368', 0.7254455089569092), ('494248', 0.7212602496147156), ('382920', 0.7092794179916382), ('530910', 0.7086726427078247), ('513421', 0.6893941760063171), ('196931', 0.6776881814002991), ('196947', 0.6705600023269653)]

Next I've tried to know, what's text of this number

model.docvecs['624319']

But it returns me only the vector representation

array([ 0.36298314, -0.8048847 , -1.4890883 , -0.3737898 , -0.00292279,
   -0.6606688 , -0.12611026, -0.14547637,  0.78830665,  0.6172428 ,
   -0.04928801,  0.36754376, -0.54034036,  0.04631123,  0.24066721,
    0.22503968,  0.02870891,  0.28329515,  0.05591608,  0.00457001],
  dtype=float32)

So, is any way to get text of this label from the model? Loading train dataset takes a lot of time, so I try to find out another way.


Solution

  • There is no way to convert a doc vector directly back into the original text (the information about word ordering, etc is lost in the process of reduction of text --> vectors).

    However, you can retrieve the original text by tagging each document with its index in your corpus list when you are creating your TaggedDocuments for Doc2Vec(). Let's say you had a corpus of sentences/documents that are contained in a list called texts. Use enumerate() like this to generate a unique index i for each sentence, and pass that as the tags argument for TaggedDocument:

    tagged_data = []
    for i, t in enumerate(texts):
        tagged_data.append(TaggedDocument(words=word_tokenize(c.lower()), tags=[str(i)]))
    
    model = Doc2Vec(vector_size=VEC_SIZE,
                    window=WINDOW_SIZE,
                    min_count=MIN_COUNT,
                    workers=NUM_WORKERS)
    
    model.build_vocab(tagged_data)
    

    Then after training, when you get the results from model.docvecs.most_similar(), the first number in each tuple will be the index into your original list of corpus texts. So for example, if you run model.docvecs.most_similar([some_vector]) and get:

    [('624319', 0.7534812092781067), ('566511', 0.7333904504776001), ('517382', 0.7264763116836548), ('523368', 0.7254455089569092), ('494248', 0.7212602496147156), ('382920', 0.7092794179916382), ('530910', 0.7086726427078247), ('513421', 0.6893941760063171), ('196931', 0.6776881814002991), ('196947', 0.6705600023269653)]

    ... then you could retrieve the original document for the first result('624319', 0.7534812092781067) by indexing into your initial corpus list with: texts[624319].

    Or if you wanted to loop through and get all of the most similar texts, you could do something like:

    most_similar_docs = []
    for d in model.docvecs.most_similar([some_vector]):
        most_similar_docs.append(texts[d[0]])