I'm trying to load a wiki dump (.gz) and use it in gensim word2vec. I convert it into bz2 using bzip2 in terminal but Wikicorpus class seems to refuse the file. Can someone please explain me how to get the text from a wiki dump in a easy way? thanks
The WikiCorpus
utility class in Gensim expects the pages-articles
dumps, not different dumps containing only abstracts.
To read another format, you'll need to write your own code.
Some things you could try:
jq
or similar, to just dump the relevant text from the XML element(s) of interest, into a plain-text file, which you could then read line-by-line in Python (to either further preprocess/tokenize or even just feed directly to Gensim's LineSentence
helper class).