Search code examples
python-3.xpandaschunks

Handling large CSV files in Python3 using concat function


I am trying to build a dataframe having two columns from 50 csv files have 5000 rows and around 15 columns. When I try to run it without using the concat function, it used up a lot of memory and I got the kill error. Now I am chunking down the database and then concating the same. The only problem is that when I concat the chunks, it keeps the header for each chunk and when I print head() for the df, it provides me the head rows of only last chunk. And is there any other way to make my code run even faster, as I have read that using the concat function in for loop make it slower. My code goes like this:-

import os
import csv
import urllib.request as urllib
import datetime as dt
import pandas as pd
import pandas_datareader.data as web
import nsepy as nse

def saveNiftySymbols():
    url = "https://www.nseindia.com/content/indices/ind_nifty50list.csv"
# pretend to be a chrome 47 browser on a windows 10 machine
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
    req = urllib.Request(url, headers = headers)
# open the url 
    x = urllib.urlopen(req)
    sourceCode = x.read().decode('utf-8') 

    cr = csv.DictReader(sourceCode.splitlines())
    l = [row['Symbol'] for row in cr]
    return l

def symbolToPath(symbol, path='/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/'):
    return os.path.join(path,"{}.csv".format(str(symbol)))

def combinedNifty(l):
    mainDf=pd.DataFrame()

    for symbol in l:
        chunks=pd.read_csv(symbolToPath(symbol),chunksize=10,usecols=['Date','Close'],index_col='Date',parse_dates=True)
        df=pd.DataFrame()
        for chunk in chunks:
            df=pd.concat([chunk])

            df.rename(columns={'Close':symbol}, inplace=True)


        if mainDf.empty:
            mainDf = df
        else:
            mainDf = mainDf.join(df, how='outer')

    print(mainDf.head())
    mainDf.to_csv('combinedNifty.csv')


combinedNifty(saveNiftySymbols())

Solution

  • The only problem is that when I concat the chunks, it keeps the header for each chunk and when I print head() for the df, it provides me the head rows of only last chunk

    This is because what is actually happening is you only have the last chunk in your df. When you run this line:

    df=pd.concat([chunk])
    

    You are actually redefining df by concatenating only your current chunk with nothing else. It is effectively as if you were doing:

    For chunk in chunks:
        df = chunk
    

    That is why when calling the head() method you are only seeing that last chunk. Instead, you don't need a for loop to concatenate chunks. Concat takes a list of dataframes as the argument and concatenates them together, so you only need to do:

    df = pd.concat(chunks)
    

    This should also improve performance, because it is better to concat once with a list of many dataframes instead of doing something like df = pd.concat([df, chunk]) in a for loop, which is probably what you intended to do in your original code.