Search code examples
pythonpandascsvglob

Reading csv files with glob to pass data to a database very slow


I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:

import requests as req
import pandas as pd
import glob
import json

endpoint = "testEndpoint"
path = "test/*.csv"

for fname in glob.glob(path):
    print(fname)

    df = pd.read_csv(fname)

    for index, row in df.iterrows():
        #print(row['ID'], row['timestamp'], row['date'], row['time'],
        #     row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])

        print(row['timestamp'])

        testjson = {"data":
                        {"installationid": row['ID'],
                         "active": row['pwrA'],
                         "reactive": row['rpwrA'],
                         "current": row['curA'],
                         "voltage": row['vltA'],
                         "frq": row['frq'],
                         }, "timestamp": row['timestamp']}

        payload = {"payload": [testjson]}
        json_data = json.dumps(payload)
        response = req.post(
            endpoint, data=json_data, headers=headers)

This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?


Solution

  • I can see 3 possible problems here:

    1. memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
    2. iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
    3. one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.

    Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.

    Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.