Search code examples
pythonnlpgensim

Loading Wikipedia XML files into Gensim


I'm a complete novice to NLP and would like to load a zipped XLM file of the Hungarian Wikipedia corpus (807 MB). I downloaded the dumpfile and started parsing it in Python with Gensim, but after 4 hours my laptop crashed, complaining that I had run out of RAM. I have a fairly old laptop (4GB RAM) and was wondering whether there is any way I could solve this problem by

  • (1) either tinkering with my code, e.g, by reducing the corpus by taking, say, a 1/10th random sample of it;
  • (2) or using some cloud platform to enhance my CPU power. I read in this SO post that AWS can be used for such puposes, but I am unsure which service I should select (Amazon EC2?). I also checked Google Colab, but got confused that it lists hardware acceleration options (GPU and CPU) in the context of Tensorflow, and I am not sure if that is suitable for NLP. I didn't find any posts about that.

Here's my Jupyter Notebook code that I've tried after downloading the wikipedia dumps from here:

! pip install gensim 
from nltk.stem import SnowballStemmer
from gensim.corpora import WikiCorpus
from gensim.models.word2vec import Word2Vec

hun_stem = SnowballStemmer(language='hungarian')

%%time
hun_wiki = WikiCorpus(r'huwiki-latest-pages-articles.xml.bz2')
hun_articles = list(hun_wiki.get_texts())
len(hun_articles)

Any guidance would be much appreciated.


Solution

  • 807MB compressed will likely expand to more than 4GB uncompressed, so you're not going to have luck loading the whole data into memory on your machine.

    But, lots of NLP tasks don't require the full dataset in memory: they can just stream the data repeatedly from the disk as necessary.

    For example, whatever your ultimate goal is, you will often be able to just iterate over the hun_wiki.get_texts() sequence, article by article. Don't try to load it into a single in-memory list with a list() operation.

    (If you really wanted to just load a subset as a list, you could just take the first n from that iterator, or take a random subset via one of the ideas at an answer like this one.)

    Or, you could rent a cloud machine with more memory. Almost anythin you choose with more memory will be suitable for running Python-based text-processing code, so just follow each service's respective tutorials to learn how to set up & log-into a new rented instance.

    (4GB is quite small for modern serious work, but if you're just tinkering/learning, you can work with smaller datasets and be efficient about not loading everything into memory when not necessary.)