I'm parsing very large (5GB to 2TB) compressed json files and storing some data into csv files with the below algorithm. It works, but is the opposite of efficient due to having three nested loops.
I'm also unsure of the cost of a few lines of code due to my unfamiliarity with the json and yaml libraries provided by python:
k = yaml.load(json.dumps(v))
If you didn't notice, I already called a yaml.load()
function
above that line with:
header = yaml.load(json.dumps(header))
It seems that I had to call the function twice because the inner leaves(values) of the keys from header
were interpreted as strings.
When I simply print out the value of v in this line: for k, v in header.iteritems():
, the output generally looks like one of these lines:
[{'value': ['4-55251088-0 0NNN RT(1535855435726 0) q(0 -1 -1 -1) r(0 -1)'], 'key': 'x_iinfo'}]
[{'value': ['timeout=60'], 'key': 'keep_alive'}, {'value': ['Sun, 02 Sep 2018 02:30:35 GMT'], 'key': 'date'}]
[{'value': ['W/"12765-1490784752000"'], 'key': 'etag'}, {'value': ['Sun, 02 Sep 2018 02:27:16 GMT'], 'key': 'date'}]
[{'value': ['Sun, 02 Sep 2018 02:30:32 GMT'], 'key': 'date'}]
so basically, if in our file there is a category called 'unknown' which is a json tree including everything without a specific category.
Is there a better way to get all of these values without slowing the algorithm down by adding two more loops?
Full method source:
def convertJsonHeadersToCSV(jsonFilePath, CSVFilePath,portNum, protocol):
try:
bodyPattern = re.compile('<(html|!DOCTYPE).*$', re.IGNORECASE | re.MULTILINE)
csvFile = open(CSVFilePath, 'w')
print("Converting " + protocol + " file to csv, please wait...")
spinner.start()
csvWriter = unicodecsv.writer(csvFile)
csvWriter.writerow(['ip', 'date', 'protocol', 'port', 'data'])
chunk_size = 128 * 1024 * 1024
with lz4.frame.open(jsonFilePath, 'r') as f:
for line in f:
try:
text = ""
jsonData = json.loads(line)
ts = jsonData['timestamp'][:10]
ip = jsonData['ip']
data = jsonData['data']['http']
if 'response' in data:
if 'headers' in data['response']:
header = jsonData['data']['http']['response']['headers']
header = yaml.load(json.dumps(header))
for k, v in header.iteritems():
if 'unknown' in k:
#print(v)
k = yaml.load(json.dumps(v))
for i in k:
#print(str(i['key']) + ": "+str(i['value']) + "\r\n")
text = text + str(str(i['key']) + ": "+str(i['value']) + "\r\n")
else:
text = text + str(str(k) + ": "+str(v) + "\r\n")
#csvWriter.writerow([ip, ts, protocol, portNum, text])
except:#sometimes will run into a unicode error, still working on handling this exception.
pass
csvFile.close()
spinner.stop()
print("Completed conversion of " + protocol + " file.")
except Exception as ex:
spinner.stop()
traceback.print_exc()
print("An error occurred while converting the file, moving on to the next task...")
what would speed this up hugely for sure would be to stop using text
as a string, because those lines:
text = text + str(str(i['key']) + ": "+str(i['value']) + "\r\n")
else:
text = text + str(str(k) + ": "+str(v) + "\r\n")
are performing a string concatenation. Since strings are immutable, a new copy must be done each time (even with text +=
instead of text = text +
, so this isn't of any help), and the bigger the string to copy, the slower (quadratic complexity).
It would be better to:
text
as an empty list"".join
in the endso
for line in f:
try:
text = [] # define an empty list at start
jsonData = json.loads(line)
then (using str?format
would also be an improvement here, but that's minor)
text.append(str(str(i['key']) + ": "+str(i['value']) + "\r\n"))
else:
text.append(str(str(k) + ": "+str(v) + "\r\n"))
and in the end "mutate" text
into a string like this:
text = "".join(text)
or just
csvWriter.writerow([ip, ts, protocol, portNum, "".join(text)])