Search code examples
pythonjsongzipwikidata

Random indexing of large Json file compressed as Gzip


I have a large json file (Wikidata dump, to be more specific) compressed as gzip. What I want to achieve is build an index, such that I can do random access and retrieve the line/entity I desire. The brute force way to find a line (entity) of interest would be:

from gzip import GzipFile

with GzipFile("path-to-wikidata/latest-all.json.gz", "r") as dump:
    for line in dump:
         # ....

An alternative that I know of is to use hdf5, do one pass over the dump, and store everything of interest in the hdf5 file. However, the issue with approach is that even one pass over Wikidata is super slow, and writing millions of entries in the hdf5 file takes a while.

Finally, I looked into indexed_gzip, using which I can seek to a random location of the file, and then read a sequence of bytes from it, as

import indexed_gzip as igzip

wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz")
# Seek to a location towards the end of the file
offset = 10000000000
# Seek to the desired location
wikidata.seek(offset)
# Read a sequence of bytes
length_of_sequence = 100000
data_bytes = wikidata.read(length_of_sequence)

however, the seeking takes extremely long in certain cases, e.g., when indexing chunks further from the start of the file. Note that this occurs only the first time I index the location, every subsequent index is same as indexing the 0 element. Evidence bellow:

# Example of entity2index mapping: Q31 --> [offset, length]
# File is ordered based on how the dump is iterated, e.g.,
# the first entity in the dictionary is first in Wikidata
entity2index: OrderedDict[str, Tuple[int, int]] = json.load(open("path-to-wikidata/wikidata_index.json"))

# Wikidata dump
wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz")

# List of entities
entities = list(entity2index.keys())

# Testing starts
entity = entities[0]
offset, _ = entity2index[entity]
# 367 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
%timeit -n 2 wikidata.seek(offset)

entity = entities[1000000]
offset, _ = entity2index[entity]
# The slowest run took 92861.95 times longer than the fastest. This # could mean that an intermediate result is being cached.
# 2.18 s ± 5.33 s per loop (mean ± std. dev. of 7 runs, 2 loops each)
%timeit -n 2 wikidata.seek(offset)

With that said, I am interested in (1) either overcoming the issue of the first indexing being significantly slower than every subsequent one, (2) any alternatives which could be better?


Solution

  • Thanks to the comment by Mark Adler, I was able to resolve the issue by pre-computing and storing two index files on disk. The first one being a dictionary, mentioned in the question, where I can map from each entity id, e.g., Q31, to the offset and length in the latest-all.json.gz file. The second, helps to achieve fast seeks, which I obtained as per the documentation of igzip:

    wikidata = igzip.IndexedGzipFile("path-to-wikidata/path-to-wikidata/latest-all.json.gz")
    wikidata.build_full_index()
    wikidata.export_index("path-to-wikidata/wikidata_seek_index.gzidx")
    

    Then, if when I want to retrieve the data for a corresponding Wikidata entity, I do:

    # First index file, mapping from Q31 --> offset and length of the chunk of data for that entity
    entity2index = json.load(open("path-to-wikidata/wikidata_index.json"))
    # Wikidata load + seeking index
    wikidata = igzip.IndexedGzipFile("path-to-wikidata/latest-all.json.gz", index_file="path-to-wikidata/wikidata_seek_index.gzidx")
    
    # Get the offset and length of the entity
    offset, length = entity2index["Q41421"]
    # Seek to the location
    wikidata.seek(offset)
    # Obtain the data chunk
    data_bytes = wikidata.read(length)
    # Load the data from the byte array
    data = json.loads(data_bytes)