Search code examples
pythonpython-3.xpandaspandas-groupbychunking

use pandas to handle massive csv file


reading bulk CSV_FILE, i have no problem if the file has 5 millions number of rows, but the problem if trying to run this code on massive file around 300 million rows, but it doesn't work with me , is there any way to enhance the code or chunk function that enhance the response time

import pandas as pd
import timeit
df = pd.read_csv('/home/mahmoudod/Desktop/to_dict/text1.txt'
                ,dtype='unicode'
                ,index_col=False
                ,error_bad_lines=False
                ,sep = ';'
                ,low_memory = False
                ,names =['DATE'
                ,'IMSI'
                ,'WEBSITE'
                ,'LINKUP'
                ,'LINKDOWN'
                ,'COUNT'
                ,'CONNECTION']

                 )
#df.DATE = pd.to_datetime(df.DATE)
group = df.groupby(['IMSI','WEBSITE']).agg({'DATE':[min,max]
    ,'LINKUP':'sum'
    , 'LINKDOWN':'sum'
    , 'COUNT':'max'
    ,'CONNECTION':'sum'
            })
group.to_csv('/home/mahmoudod/Desktop/to_dict/output.txt')

Solution

  • One solution is offered by dask.dataframe, which chunks internally:

    import dask.dataframe as dd
    
    df = dd.read_csv(...)
    group = df.groupby(...).aggregate({...}).compute()
    group.to_csv('output.txt')
    

    This isn't tested. I suggest you read the documentation to familiarize yourself with the syntax. The important point to understand is dd.read_csv does not read the whole file in memory and no operations are processed until compute is called, at which point dask processes in constant memory via chunking.