Search code examples
pythonlistnested-for-loop

Is there a faster way to parse large json list of lists?


I am fetching responses from thousands of API calls, each of which is a new JSON, as the api is paginated. The result I get is a list of lists, with each inner list the JSON of one page. The following code is how I am successfully parsing the data:

import csv
import...


def save(csv_data):
    with open(today_date+".csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerows(csv_data)


    for p in range(pet_length):
        loop_length = 0
        try:
            pet_count += len(all_pet_data[p]['leaderboard'])
        except Exception as e:
            print(f'{e} in outer loop')
            print(f"Data is {all_pet_data[p]}")
            error_count += 1
        try:
            loop_length = len(all_pet_data[p]['leaderboard'])
        except Exception as e:
            print(f'{e} in inner loop length check')
            error_count += 1
        for i in range(loop_length):
            ids.append(all_pet_data[p]['leaderboard'][i]['id'])
            csv_data.append([all_pet_data[p]['leaderboard'][i]['id'], all_pet_data[p]['leaderboard'][i]['level'],
                            all_pet_data[p]['leaderboard'][i]['name']])
    # multiple cleaning methods to eliminate any blanks or repeats in the final:
    csv_data = [c for c in csv_data if c != []]
    for c in range(len(csv_data)):
        for s in range(len(csv_data)):
            if c != s and csv_data[c][0] == csv_data[s][0]:
                csv_data[s] = ['delete', 'delete', 'delete']
    csv_data = [c for c in csv_data if c != ['delete', 'delete', 'delete']]
    # debugging checks:
    print(f'error_count: {error_count}')
    print(f'Total pets grabbed: {pet_count}')
    print(f'Total ids grabbed: {len(ids)}')
    print(f'Unique pets grabbed: {len(csv_data)}')
    csv_data = [c for c in csv_data if c != [[]]]
    save(csv_data)  # saves the data as a CSV

The issue is, with thousands of pages of the API, and up to 1k "pets" per page, these nested loops are massive. In general, the final result is expected to be around 80k rows. So, roughly, 4k outer loops with around 20-25 inner loops on average. So, it is running very slowly. I believe it's the two nested for loops, but I've included the actual CSV save in case that is poorly written. The calls themselves take a long time and are prone to errors, so I haven't determined exactly where the final slowdown is. Is there something I could be doing better here to speed all of this up?

I should note, I can't figure out any logic to remove duplicates before the csv nested for loop, because the repeated ids could happen anywhere. Plus, I'm using multi-threading to fetch everything, so they are in random orders.


Solution

  • You are doing a lot of append to list, TimeComplexity notes that

    Individual actions may take surprisingly long, depending on the history of the container

    therefore I suggest you try using collections.deque as replacement, though as you then access elements of csv_data I would suggest you convert it to list before that i.e. something like

    import collections
    csv_data = collections.deque()
    # loop which do appends
    csv_data = list(csv_data)
    # loop which access elements
    

    if you are okay with using external modules, you might consider using faster JSON parser, for example ujson.