Search code examples
pythonzlib

Chunking API response cuts off required data


I am reading chunks of data that is an API response using the following code:

d = zlib.decompressobj(zlib.MAX_WBITS|16)  # for gzip
for i in range(0, len(data), 4096):
    chunk = data[i:i+4096]
    # print(chunk)
    str_chunk = d.decompress(chunk)
    str_chunk = str_chunk.decode()
    # print(str_chunk)
    if '"@odata.nextLink"' in str_chunk:
        ab = '{' + str_chunk[str_chunk.index('"@odata.nextLink"'):len(str_chunk)+1]
        ab = ast.literal_eval(ab)
        url = ab['@odata.nextLink']
        return url

An example of this working is: "@odata.nextLink":"someurl?$count=true

It works in most cases but sometimes this key value pair gets cut off and it appears something like this: "@odata.nextLink":"someurl?$coun

I can play around with the number of bits in this line for i in range(0, len(data), 4096) but that doesn't ensure that in some cases the data doesn't cut off as the page sizes (data size) can be different for each page size.

How can I ensure that this key value pair is never cut off. Also, note that this key value pair is the last line/ last key-value pair of the API response.

P.S.: I can't play around with API request parameters.

Even tried reading it backwards but this gives a header incorrect issue:

for i in range(len(data), 0, -4096):
                chunk = data[i -4096: i]
                str_chunk = d.decompress(chunk)
                str_chunk = str_chunk.decode()
                if '"@odata.nextLink"' in str_chunk:
                    ab = '{' + str_chunk[str_chunk.index('"@odata.nextLink"'):len(str_chunk)+1]
                    ab = ast.literal_eval(ab)
                    url = ab['@odata.nextLink']
                    #print(url)
                    return url

The above produces the following error which is really strange:

str_chunk = d.decompress(chunk)
zlib.error: Error -3 while decompressing data: incorrect header check

Solution

  • str_chunk is a contiguous sequence of bytes from the API response that can start anywhere in the response, and end anywhere in the response. Of course it will sometimes end in the middle of some semantic content.

    (New information from comment that OP neglected to put in question. In fact, still not in question. OP requires that entire uncompressed content not be saved in memory.)

    If "@odata.nextLink" is a reliable marker for what you're looking for, then keep the last two decompressed chunks, concatenate those, then look for that marker. Once found, continue to read more chunks, concatenating them, until you have the full content you're looking for.