Search code examples
pythonnlpgensimwikipediadump

does gensim.corpora wikiCorpus work only with bz2 file?


I'm trying to load a wiki dump (.gz) and use it in gensim word2vec. I convert it into bz2 using bzip2 in terminal but Wikicorpus class seems to refuse the file. Can someone please explain me how to get the text from a wiki dump in a easy way? thanks


Solution

  • The WikiCorpus utility class in Gensim expects the pages-articles dumps, not different dumps containing only abstracts.

    To read another format, you'll need to write your own code.

    Some things you could try:

    • Study the source for the WikiCorpus class & use it as a model for your own code, adapting it to read the different elements out of your other dump.
    • Use some other utility, for example the command-line tool jq or similar, to just dump the relevant text from the XML element(s) of interest, into a plain-text file, which you could then read line-by-line in Python (to either further preprocess/tokenize or even just feed directly to Gensim's LineSentence helper class).