I am a bit new here. I have a project where I have to download and use Wikipedia for NLP. The questions I am facing are as follows: I have RAM of only 12 GB, but the English wiki dump is over 15 GB compressed. Does this limit my processing of wiki? I do not need any picture from the wiki. Do I need to uncompress the dump before processing? Can someone just tell me the steps required or point to me related content for it? Thanks in advance.
The easiest to process wikipedia dump is to rely on kiwix.org dump that you can find at: https://wiki.kiwix.org/wiki/Content_in_all_languages
Then using python you can do the following
% wget http://download.kiwix.org/zim/wiktionary_eo_all_nopic.zim
...
% pip install --user libzim
% ipython
In [2]: from libzim.reader import File
In [3]: total = 0
...:
...: with File("wiktionary_eo_all_nopic.zim") as reader:
...: for uid in range(0, reader.article_count):
...: page = reader.get_article_by_id(uid)
...: total += len(page.content)
...: print(total)
This is an simplistic processing, you should get the point to get started. In particular, as of 2020, the raw wikipedia dump using wikimarkup are very difficult to process in the sense you can not convert wikimarkup to html including infoboxes without a full wikimedia setup. There is also the REST API but why struggle when the work is already done :)
Regarding where to store the data AFTER processing, I think the industry standard is PostgreSQL or ElasticSearch (which also requires lots of memory) but I really like hoply, and more generally OKVS.