Structure of Gensim Word Embedding corpus

I want to train a word2vec model using Gensim. I preprocessed my corpus, which is made of hundreds of thousands of articles from a specific newspaper. I preprocessed them (lower casing, lemmatizing, removing stop words and punctuations, etc.) and then make a list of lists, in which each element is a list of words.

corpus = [['first', 'sentence', 'second', 'dictum', 'third', 'saying', 'last', 'claim'],
          ['first', 'adage', 'second', 'sentence', 'third', 'judgment', 'last', 'pronouncement']]

I wanted to know if it is the right way, or it should be like the following:

corpus = [['first', 'sentence'], ['second', 'dictum'], ['third', 'saying'], ['last', 'claim'], ['first', 'adage'], ['second', 'sentence'], ['third', 'judgment'], ['last', 'pronouncement']]

Solution

Both would minimally work.

But in the second, no matter how big your window parameter, the fact all texts are no more than 2 tokens long means words will only affect their immediate neighbors. That's probably not what you want.

There's no real harm in longer texts, except to note that:

Tokens all in the same list will appear in each other's window-sized neighborhood - so don't run words together that shouldn't imply any realistic use alongside each other. (But, in large-enough corpuses, even the noise of some run-together unrelated texts won't make much difference, swamped by the real relationships in the bulk of the texts.)
Each text shouldn't be more than 10,000 tokens long, as an internal implementation limit will cause any tokens beyond that limit to be ignored.