Search code examples
pythongensimdoc2vec

gensim docvecs.doctags incorrect indices


I'm working with a large dataset of Yelp reviews for a machine learning research project. Gensim has worked well so far, however, when I build the vocabulary with doc2vec.build_vocab() on the over 5,000,000 documents I have...the indices appear to all be collected into a 64-key dictionary (which should certainly not be the case).

Below is the script I made for tagging the documents, building the vocabulary, and training the model.

import os
import time
import pandas as pd
import numpy as np
from collections import namedtuple
from gensim.models.doc2vec import Doc2Vec
from keras.preprocessing.text import text_to_word_sequence

# keras helper function
def text2_word_seq(review):
  return text_to_word_sequence(review, 
       filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
       lower=True, split=" ")

# instantiate the model
d2v = Doc2Vec(vector_size=300, 
  window=6, min_count=5, workers=os.cpu_count()-1)

chunksize = 5000
train_data = pd.read_json("dataset/review.json",
    chunksize=chunksize,
    lines=True)

Review = namedtuple('Review', 'words tags')
documents = list()
for i, data in enumerate(train_data):
    print("Looked at %d chunks, %d documents" % 
       (i, i*chunksize), end='\r', flush=True)
    users = data.user_id.values
    for j, review in enumerate(data.text):
        documents.append(Review(text2_word_seq(review), users[j]))

# build the vocabulary 
d2v.build_vocab(documents.__iter__(), update=False,
   progress_per=100000, keep_raw_vocab=False, trim_rule=None)

# train the model
d2v.train(documents, total_examples=len(documents), epochs=10)
d2v.save('d2v-model-v001')

After saving the model and loading it with genim.models.Doc2Vec.load(), the model's docvecs.doctags is of length 64. Each tag I am using when building the vocabulary is a user id. It is not necessarily unique, but there are thousands of users (not 64). Also, the tags appear as single characters - which is not expected...

>>> len(x.docvecs.doctags)

64

>>> x.docvecs.doctags

{'Y': Doctag(offset=27, word_count=195151634, doc_count=1727798), 
'j': Doctag(offset=47, word_count=198241878, doc_count=1739169), 
'4': Doctag(offset=17, word_count=195902251, doc_count=1728095), 
'J': Doctag(offset=50, word_count=197884244, doc_count=1741666), 
'W': Doctag(offset=41, word_count=198804200, doc_count=1741269), 
'O': Doctag(offset=23, word_count=196212468, doc_count=1728735), 
'o': Doctag(offset=9, word_count=194177928, doc_count=1709768), 
'n': Doctag(offset=3, word_count=193799059, doc_count=1714620), 
'3': Doctag(offset=34, word_count=197320036, doc_count=1725467), 
'F': Doctag(offset=10, word_count=195614702, doc_count=1729058) ...

What am I doing wrong here?


Solution

  • The tags property of your text examples should be a list-of-tags. (It can be a list containing just a single tag, but it must be a list.)

    If you provide a string instead, that will look like a list-of-one-character-strings to the code expecting a lis. Thus you'll train just a small number of single-character-tags, one per unique character appearing in any of the tags strings you provided.