Search code examples
pythonfor-looppython-requestschunked-encoding

Python Requests - Chunked Streaming


due to a faulty server design, I'm having to stream down JSON and correct a null byte if I find one. I'm using python requests to do this. Each JSON event is delimited by a \n. What I am trying to do here is pull down a chunk (which will always be less than one log line). Search through that chunk for the end of event signifier ("\"status\":\d+\d+\d+}}\n").

If that signifier is there I will do something with the full JSON event, if not, I add that chunk to a buffer, b, then grab the next chunk and look for the identifier. As soon as I get this down, I'll start searching for the null byte.

b = ""

for d in r.iter_content(chunk_size=25):

    s = re.search("\"status\":\d+\d+\d+}}\n", d)

    if s:
        d = d.split("\n", 1)
        fullLogLine = b + d[0]
        b = d[1]
    else:
        b = b + d

I'm completely losing the value of b in this case. It doesn't seem to carry over through the iter_content. Whenever I try to print the value of just b it's empty. I feel I'm missing something obvious here. Anything helps. Thanks.


Solution

  • First of all, that regex is messed up \d+ means 'one or more digits' so why chain three of them together? Also, you need to use 'raw string' for this sort a pattern as \ is treated as an escape character so your pattern doesn't get built properly. You'd want to change it to re.search(r'"status":\d+}}', d).

    Secondly, your d.split() line can pick up a wrong \n if there are two newlines in your chunk.

    You don't even need regex for this, good ol' Python string search/slicing is more than enough to ensure you get your delimiters right:

    logs = []  # store for our individual entries
    buffer = []  # buffer for our partial chunks
    for chunk in r.iter_content(chunk_size=25):  # read chunk-by-chunk...
        eoe = chunk.find("}}\n")  # seek the guaranteed event delimiter
        while eoe != -1:  # a potential delimiter found, let's dig deeper...
            value_index = chunk.rfind(":", 0, eoe)  # find the first column before it
            if eoe-1 >= value_index >= eoe-4:  # woo hoo, there are 1-3 characters between
                try:  # lets see if it's a digit...
                    status_value = int(chunk[value_index+1:eoe])  # omg, we're getting there...
                    if chunk[value_index-8:value_index] == '"status"':  # ding, ding, a match!
                        buffer.append(chunk[:eoe+2])  # buffer everything up to the delimiter
                        logs.append("".join(buffer))  # flatten the buffer and write it to logs
                        chunk = chunk[eoe + 3:]  # remove everything before the delimiter
                        eoe = 0  # reset search position
                        buffer = []  # reset our buffer
                except (ValueError, TypeError):  # close but no cigar, ignore
                    pass  # let it slide...
            eoe = chunk.find("}}\n", eoe + 1)  # maybe there is another delimiter in the chunk...
        buffer.append(chunk)  # add the current chunk to buffer
    if buffer and buffer[0] != "":  # there is still some data in the buffer
            logs.append("".join(buffer))  # add it, even if not complete...
    
    # Do whatever you want with the `logs` list...
    

    It looks complicated but it's actually quite easy if you read it line by line, and you'll have to do some of these complexities (overlapping matches and such) with a regex match, too (to account for potential multiple event delimiters in the same chunk).