Search code examples
gensimword2vecword-embeddingcorpus

Structure of Gensim Word Embedding corpus


I want to train a word2vec model using Gensim. I preprocessed my corpus, which is made of hundreds of thousands of articles from a specific newspaper. I preprocessed them (lower casing, lemmatizing, removing stop words and punctuations, etc.) and then make a list of lists, in which each element is a list of words.

corpus = [['first', 'sentence', 'second', 'dictum', 'third', 'saying', 'last', 'claim'],
          ['first', 'adage', 'second', 'sentence', 'third', 'judgment', 'last', 'pronouncement']]

I wanted to know if it is the right way, or it should be like the following:

corpus = [['first', 'sentence'], ['second', 'dictum'], ['third', 'saying'], ['last', 'claim'], ['first', 'adage'], ['second', 'sentence'], ['third', 'judgment'], ['last', 'pronouncement']]

Solution

  • Both would minimally work.

    But in the second, no matter how big your window parameter, the fact all texts are no more than 2 tokens long means words will only affect their immediate neighbors. That's probably not what you want.

    There's no real harm in longer texts, except to note that:

    • Tokens all in the same list will appear in each other's window-sized neighborhood - so don't run words together that shouldn't imply any realistic use alongside each other. (But, in large-enough corpuses, even the noise of some run-together unrelated texts won't make much difference, swamped by the real relationships in the bulk of the texts.)
    • Each text shouldn't be more than 10,000 tokens long, as an internal implementation limit will cause any tokens beyond that limit to be ignored.