Search code examples
jsonpython-3.xperformancepandas-datareader

Reading a big JSON file with multiple objects in Python


I have a big GZ compressed JSON file where each line is a JSON object (i.e. a python dictionary).

Here is an example of the first two lines:

  {"ID_CLIENTE":"o+AKj6GUgHxcFuaRk6/GSvzEWRYPXDLjtJDI79c7ccE=","ORIGEN":"oaDdZDrQCwqvi1YhNkjIJulA8C0a4mMZ7ESVlEWGwAs=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.0023907284768211919,"RESERVA":"2015-05-20","SALIDA":"2015-07-26","LLEGADA":"2015-07-27","DISTANCIA":0.48962542317352847,"EDAD":"19","sexo":"F"}{"ID_CLIENTE":"WHDhaR12zCTCVnNC/sLYmN3PPR3+f3ViaqkCt6NC3mI=","ORIGEN":"gwhY9rjoMzkD3wObU5Ito98WDN/9AN5Xd5DZDFeTgZw=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.001103046357615894,"RESERVA":"2015-04-08","SALIDA":"2015-07-24","LLEGADA":"2015-07-24","DISTANCIA":0.21382548869717155,"EDAD":"13","sexo":"M"}

So, I'm using the following code to read each line into a Pandas DataFrame:

import json
import gzip
import pandas as pd
import random

with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
    data_lan = pd.DataFrame()
    for line in fin:
        data_lan = pd.DataFrame([json.loads(line.decode('utf-8'))]).append(data_lan)

But it's taking years. Any suggestion to read the data quicker?

EDIT: Finally what solved the problem:

import json
import gzip
import pandas as pd

with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
    data_lan = []
    for line in fin:
        data_lan.append(json.loads(line.decode('utf-8')))

data = pd.DataFrame(data_lan)

Solution

  • I've worked on a similar problem myself, The append() is kinda slow. I generally use a list of dicts to load the json file and then create a Dataframe at once. This ways, you can have the flexibility the lists give you and only when you're sure about the Data in the list you convert it into a Dataframe. Below is an implementation of the concept:

    import pandas as pd
    import gzip
    
    
    def get_contents_from_json(file_path)-> dict:
        """
        Reads the contents of the json file into a dict
        :param file_path:
        :return: A dictionary of all contents in the file.
        """
        try:
            with gzip.open(file_path) as file:
                contents = file.read()
            return json.loads(contents.decode('UTF-8'))
        except json.JSONDecodeError:
            print('Error while reading json file')
        except FileNotFoundError:
            print(f'The JSON file was not found at the given path: \n{file_path}')
    
    
    def main(file_path: str):
        file_contents = get_contents_from_json(file_path)
        if not isinstance(file_contents,list):
            # I've considered you have a JSON Array in your file
            # if not let me know in the comments
            raise TypeError("The file doesn't have a JSON Array!!!")
        all_columns = file_contents[0].keys()
        data_frame = pd.DataFrame(columns=all_columns, data=file_contents)
        print(f'Loaded {int(data_frame.size / len(all_columns))} Rows', 'Done!', sep='\n')
    
    
    if __name__ == '__main__':
        main(r'C:\Users\carrot\Desktop\dummyData.json.gz')