due to a faulty server design, I'm having to stream down JSON and correct a null byte if I find one. I'm using python requests
to do this. Each JSON event is delimited by a \n
. What I am trying to do here is pull down a chunk (which will always be less than one log line). Search through that chunk for the end of event signifier ("\"status\":\d+\d+\d+}}\n"
).
If that signifier is there I will do something with the full JSON event, if not, I add that chunk to a buffer, b
, then grab the next chunk and look for the identifier. As soon as I get this down, I'll start searching for the null byte.
b = ""
for d in r.iter_content(chunk_size=25):
s = re.search("\"status\":\d+\d+\d+}}\n", d)
if s:
d = d.split("\n", 1)
fullLogLine = b + d[0]
b = d[1]
else:
b = b + d
I'm completely losing the value of b in this case. It doesn't seem to carry over through the iter_content
. Whenever I try to print the value of just b
it's empty. I feel I'm missing something obvious here. Anything helps. Thanks.
First of all, that regex is messed up \d+
means 'one or more digits' so why chain three of them together? Also, you need to use 'raw string' for this sort a pattern as \
is treated as an escape character so your pattern doesn't get built properly. You'd want to change it to re.search(r'"status":\d+}}', d)
.
Secondly, your d.split()
line can pick up a wrong \n
if there are two newlines in your chunk.
You don't even need regex for this, good ol' Python string search/slicing is more than enough to ensure you get your delimiters right:
logs = [] # store for our individual entries
buffer = [] # buffer for our partial chunks
for chunk in r.iter_content(chunk_size=25): # read chunk-by-chunk...
eoe = chunk.find("}}\n") # seek the guaranteed event delimiter
while eoe != -1: # a potential delimiter found, let's dig deeper...
value_index = chunk.rfind(":", 0, eoe) # find the first column before it
if eoe-1 >= value_index >= eoe-4: # woo hoo, there are 1-3 characters between
try: # lets see if it's a digit...
status_value = int(chunk[value_index+1:eoe]) # omg, we're getting there...
if chunk[value_index-8:value_index] == '"status"': # ding, ding, a match!
buffer.append(chunk[:eoe+2]) # buffer everything up to the delimiter
logs.append("".join(buffer)) # flatten the buffer and write it to logs
chunk = chunk[eoe + 3:] # remove everything before the delimiter
eoe = 0 # reset search position
buffer = [] # reset our buffer
except (ValueError, TypeError): # close but no cigar, ignore
pass # let it slide...
eoe = chunk.find("}}\n", eoe + 1) # maybe there is another delimiter in the chunk...
buffer.append(chunk) # add the current chunk to buffer
if buffer and buffer[0] != "": # there is still some data in the buffer
logs.append("".join(buffer)) # add it, even if not complete...
# Do whatever you want with the `logs` list...
It looks complicated but it's actually quite easy if you read it line by line, and you'll have to do some of these complexities (overlapping matches and such) with a regex match, too (to account for potential multiple event delimiters in the same chunk).