Search code examples
machine-learningnlpword2vecdoc2vec

Build vocab in doc2vec


I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?

    lst_doc = doc.translate(str.maketrans('', '', string.punctuation))

    target_data = word_tokenize(lst_doc)

    train_data = list(read_data())

    model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

    train_vocab = model.build_vocab(train_data)

    print(train_vocab)

   {train = model.train(train_data, total_examples=model.corpus_count, 
   epochs=model.epochs) }

Output: None


Solution

  • A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.

    So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.

    If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.

    You can also examine the state of the model and its various internal properties after the code has run.

    For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).

    Other comments:

    • It's not clear your 1st two lines – assigning into lst_doc and target_data – affect anything further, since it's unclear what read_data() might be doing to fill the train_corpus.

    • Often low min_count values worsen results, by including more words that have so few usage examples that they're little more than noise during training.

    • only 500 documents is rather small compared to most published work showing impressive results with this algorithm, which uses tens-of-thousands of documents (if not millions). So, keep in mind that results on such a small dataset may be unrepresentative of what's possible with a larger corpus - in terms of quality, optimal parameters, etc.