Search code examples
pythonmysqljsonpython-2.7ijson

load 1.4 GB json data into mysql using ijson python


I came across several thread talking about ijson to load huge JSON files in python, as this is the way to not consume all memory.

My file is around 1.4 GB in size, it has several nodes (see below image), i am interested only in one node which hold most of the data (c_driver_location).

JSON_1.4GB

my goal is : i want to extract c_driver_location node data only and insert it into mysql db table (which will have four columns: id,longitude,latitude,timestamp).

table ddl:

create table drv_locations_backup7May2017 (id bigint unsigned auto_increment primary key, drv_fb_id varchar(50), latitude DECIMAL(10, 8) NOT NULL, longitude DECIMAL(11, 8) NOT NULL, timestamp int )

my problem is : i ran the first part of the attached code (till before connecting to mysql), but it is running since 20 hours and still did not finish parsing the json. (i tested on smaller file and it worked fine).

Is there an optimal way to make this faster and more efficient ?

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import ijson
import pymysql.cursors
import pymysql


filename = "D:\json_file.json"
drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
drv_df = DataFrame(columns = drv_col_list)
drv_df.timestamp = drv_df.timestamp.astype(int)

counter = 0
with open(filename, 'r') as fd:
    parser = ijson.parse(fd)
    for prefix, event, value in parser:
        if prefix == 'c_driver_location' and str(event) == 'map_key':
            drv_fb_id = value
            counter = counter + 1
        elif prefix.endswith('.latitude'):
            latitude = value
        elif prefix.endswith('.longitude'):
            longitude = value
        elif prefix.endswith('.timestamp'):
            timestamp = value
        elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
            drv_df = drv_df.append(pd.DataFrame({'drv_fb_id':drv_fb_id,'latitude':latitude,'longitude':longitude,'timestamp':timestamp},index=[0]),ignore_index=True)
connection = pymysql.connect(host='53.000.00.00',
                             port = 3306,
                             user='user',
                             password='abcdefg',
                             db ='newdb',
                             # charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)
# write to mysql 
drv_df.to_sql(con=connection, name='drv_locations_backup7May2017', if_exists='replace', flavor='mysql')                                               
connection.close()

Solution

  • You just need to modify your code slightly to produce a data dump.

    import ijson
    
    
    outfile = "D:\upload_data.txt"
    filename = "D:\json_file.json"
    drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
    timestamp = drv_df.timestamp.astype(int)
    
    
    ofile = open(outfile, "rw")
    
    counter = drv_fb_id = latitude = longitude = 0
    with open(filename, 'r') as fd:
        parser = ijson.parse(fd)
        for prefix, event, value in parser:
            if prefix == 'c_driver_location' and str(event) == 'map_key':
                drv_fb_id = value
                counter = counter + 1
            elif prefix.endswith('.latitude'):
                latitude = value
            elif prefix.endswith('.longitude'):
                longitude = value
            elif prefix.endswith('.timestamp'):
                timestamp = value
            elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
                print >>ofile, ",".join(map(str, [drv_fb_id, latitude, longitude, timestamp]))           
    
    close(ofile)
    

    Now you have a comma delimited output in D:\upload_data.txt

    The code is untested.

    I do not have a test mysql database at the moment. I trust that the mysql manual is easy to follow . You table structure is not really complicated.