Preprocessing a corpus for different Word Embedding Algorithms

For my Bachelorthesis I need to train different word embedding algorithms on the same corpus to benchmark them. I am looking to find preprocessing steps but am not sure which ones to use and which ones might be less useful.

I already looked for some studies but also wanted to ask if someone has experience with this.

My objective is to train Word2Vec, FastText and GloVe Embeddings on the same corpus. Not too sure which one now, but I think of Wikipedia or something similar.

In my opinion:

POS-Tagging
remove non-alphabetic characters with regex or similar
Stopword removal
Lemmatization
catching Phrases

are the logical options.

But I heard that stopword removal can be kind of tricky, because there is a chance that some embeddings still contain stopwords due to the fact that automatic stopword removal might not fit to any model/corpus.

Also I have not decided if I want to choose spacy or nltk as library, spacy is mightier but nltk is mainly used at the chair I am writing.

Solution

Preprocessing is like hyperparameter optimization or neural architecture search. There isn't a theoretical answer to "which one should I use". The applied section of this field (NLP) is far ahead of the theory. You just run different combinations until you find the one that works best (according to your choice of metric).

Yes Wikipedia is great, and almost everyone uses it (plus other datasets). I've tried spacy and it's powerful, but I think I made a mistake with it and I ended up writing my own tokenizer which worked better. YMMV. Again, you just have to jump in and try almost everything. Check with your advisor that you have enough time and computing resources.