Search code examples
pythonpandascsvreadfiledask

python: how can I read and process a 18GB csv file?


I have a 18GB csv file from measurement and want to do some calculation based on it. I tried to do it with pandas but seems like it takes forever just to read this file.

Following codes are what I did:

df=pd.read_csv('/Users/gaoyingqiang/Desktop/D989_Leistung.csv',usecols=[1,2],sep=';',encoding='gbk',iterator=True,chunksize=1000000)
df=pd.concat(df,ignore_index=True)

U1=df['Kanal 1-1 [V]']
I1=df['Kanal 1-2 [V]']

c=[]
for num in range(0,16333660,333340):
    lu=sum(U1[num:num+333340]*U1[num:num+333340])/333340
    li=sum(I1[num:num+333340]*I1[num:num+333340])/333340
    lui=sum(I1[num:num+333340]*U1[num:num+333340])/333340
    c.append(180*mt.acos(2*lui/mt.sqrt(4*lu*li))/np.pi)
    lu=0
    li=0
    lui=0

phase=pd.DataFrame(c)
phase.to_excel('/Users/gaoyingqiang/Desktop/Phaseverschiebung_1.xlsx',sheet_name='Sheet1')

Is there anyway to accelerate the process?


Solution

  • df is a TextFileReader, not DataFrame, so need concat:

    df = pd.concat(df, ignore_index=True)
    

    Sample:

    import pandas as pd
    from pandas.compat import StringIO
    
    temp=u"""id,col1,col2,col3
    1,13,15,14
    1,13,15,14
    1,12,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    2,18,15,13
    3,14,15,13
    3,14,15,13
    3,14,185,213"""
    df = pd.read_csv(StringIO(temp), chunksize=3)
    print (df)
    <pandas.io.parsers.TextFileReader object at 0x000000000D6E2EF0>
    
    df = pd.concat(df, ignore_index=True)
    print (df)
        id  col1  col2  col3
    0    1    13    15    14
    1    1    13    15    14
    2    1    12    15    13
    3    2    18    15    13
    4    2    18    15    13
    5    2    18    15    13
    6    2    18    15    13
    7    2    18    15    13
    8    2    18    15    13
    9    3    14    15    13
    10   3    14    15    13
    11   3    14   185   213