Search code examples
pythonjsonwikidatabz2

How to parse Wikidata JSON (.bz2) file using Python?


I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).

However, I cannot open the file, it's just too big for my computer.

Is there a way to look into the file without extracting the full .bz2 file. Especially using Python, I know that there is a PHP dump reader (here), but I can't use it.


Solution

  • you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.