I want to train a word2vec model using Gensim. I preprocessed my corpus, which is made of hundreds of thousands of articles from a specific newspaper. I preprocessed them (lower casing, lemmatizing, removing stop words and punctuations, etc.) and then make a list of lists, in which each element is a list of words.
corpus = [['first', 'sentence', 'second', 'dictum', 'third', 'saying', 'last', 'claim'],
['first', 'adage', 'second', 'sentence', 'third', 'judgment', 'last', 'pronouncement']]
I wanted to know if it is the right way, or it should be like the following:
corpus = [['first', 'sentence'], ['second', 'dictum'], ['third', 'saying'], ['last', 'claim'], ['first', 'adage'], ['second', 'sentence'], ['third', 'judgment'], ['last', 'pronouncement']]
Both would minimally work.
But in the second, no matter how big your window
parameter, the fact all texts are no more than 2 tokens long means words will only affect their immediate neighbors. That's probably not what you want.
There's no real harm in longer texts, except to note that:
window
-sized neighborhood - so don't run words together that shouldn't imply any realistic use alongside each other. (But, in large-enough corpuses, even the noise of some run-together unrelated texts won't make much difference, swamped by the real relationships in the bulk of the texts.)