Search code examples
pythonjsonalgorithmunicodelz4

How can I improve my method for parsing lz4 compressed json?


I'm parsing very large (5GB to 2TB) compressed json files and storing some data into csv files with the below algorithm. It works, but is the opposite of efficient due to having three nested loops.

I'm also unsure of the cost of a few lines of code due to my unfamiliarity with the json and yaml libraries provided by python:

k = yaml.load(json.dumps(v))

If you didn't notice, I already called a yaml.load() function above that line with:

header = yaml.load(json.dumps(header))

It seems that I had to call the function twice because the inner leaves(values) of the keys from header were interpreted as strings.

When I simply print out the value of v in this line: for k, v in header.iteritems():, the output generally looks like one of these lines:

[{'value': ['4-55251088-0 0NNN RT(1535855435726 0) q(0 -1 -1 -1) r(0 -1)'], 'key': 'x_iinfo'}]
[{'value': ['timeout=60'], 'key': 'keep_alive'}, {'value': ['Sun, 02 Sep 2018 02:30:35 GMT'], 'key': 'date'}]
[{'value': ['W/"12765-1490784752000"'], 'key': 'etag'}, {'value': ['Sun, 02 Sep 2018 02:27:16 GMT'], 'key': 'date'}]
[{'value': ['Sun, 02 Sep 2018 02:30:32 GMT'], 'key': 'date'}]

so basically, if in our file there is a category called 'unknown' which is a json tree including everything without a specific category.

Is there a better way to get all of these values without slowing the algorithm down by adding two more loops?

Full method source:

def convertJsonHeadersToCSV(jsonFilePath, CSVFilePath,portNum, protocol):
  try:
    bodyPattern = re.compile('<(html|!DOCTYPE).*$', re.IGNORECASE | re.MULTILINE)
    csvFile = open(CSVFilePath, 'w')
    print("Converting " + protocol + " file to csv, please wait...")
    spinner.start()
    csvWriter = unicodecsv.writer(csvFile)
    csvWriter.writerow(['ip', 'date', 'protocol', 'port', 'data'])
    chunk_size = 128 * 1024 * 1024
    with lz4.frame.open(jsonFilePath, 'r') as f:
      for line in f:
        try:
          text = ""
          jsonData = json.loads(line)
          ts = jsonData['timestamp'][:10]
          ip = jsonData['ip']
          data = jsonData['data']['http']
          if 'response' in data:
            if 'headers' in data['response']:
              header = jsonData['data']['http']['response']['headers']
              header = yaml.load(json.dumps(header))
              for k, v in header.iteritems():
                if 'unknown' in k:
                  #print(v)
                  k = yaml.load(json.dumps(v))
                  for i in k:
                    #print(str(i['key']) + ": "+str(i['value']) + "\r\n")
                    text = text + str(str(i['key']) + ": "+str(i['value']) + "\r\n")
                else:
                  text = text + str(str(k) + ": "+str(v) + "\r\n")
              #csvWriter.writerow([ip, ts, protocol, portNum, text])

        except:#sometimes will run into a unicode error, still working on handling this exception.
          pass
    csvFile.close()
    spinner.stop()
    print("Completed conversion of " + protocol + " file.")
  except Exception as ex:
    spinner.stop()
    traceback.print_exc()
    print("An error occurred while converting the file, moving on to the next task...")

Solution

  • what would speed this up hugely for sure would be to stop using text as a string, because those lines:

        text = text + str(str(i['key']) + ": "+str(i['value']) + "\r\n")
    else:
      text = text + str(str(k) + ": "+str(v) + "\r\n")
    

    are performing a string concatenation. Since strings are immutable, a new copy must be done each time (even with text += instead of text = text +, so this isn't of any help), and the bigger the string to copy, the slower (quadratic complexity).

    It would be better to:

    • define text as an empty list
    • append to the list
    • use "".join in the end

    so

     for line in f:
        try:
          text = []   # define an empty list at start
          jsonData = json.loads(line)
    

    then (using str?format would also be an improvement here, but that's minor)

           text.append(str(str(i['key']) + ": "+str(i['value']) + "\r\n"))
        else:
          text.append(str(str(k) + ": "+str(v) + "\r\n"))
    

    and in the end "mutate" text into a string like this:

    text = "".join(text)
    

    or just

    csvWriter.writerow([ip, ts, protocol, portNum, "".join(text)])