Search code examples
pythonpandasparsingmultilinefixed-width

Slow parsing of fixed-width, alternating-line file to pandas dataframe


I have written a function to parse this wind file (wind.txt ~1MB) into a pandas dataframe but it's pretty slow (according to my colleague) because of the nastiness of the file format. The file linked above is just a subset of the larger file which has hourly wind data from 1900 to 2016. Here's a snippet of the file:

2000  1  1 CCB Wdir   5 11 15 14 14 14 14 16 15 15 15 15 13 12 16 16 15 15 15 15 15 14 14 14
2000  1  1 CCB Wspd  10  8  6  8  7  7  8  8  6  8  9  7 16 16  7 10 12 14 15 17 18 22 22 20
2000  1  2 CCB Wdir  14 14 14 14 14 16 16 16 16 15 15 16 17 17 16 17 16 16 16 15 15 15 15 16
2000  1  2 CCB Wspd  17 16 15 17 15 15 16 14 14 15 17 16 15 13 14 15 15 21 20 20 18 25 23 21
2000  1  3 CCB Wdir  15 15 15 16 15 16 16 16 16 16 16 20 18 22 28 27 26 31 32 32 33 33 35 33
2000  1  3 CCB Wspd  20 22 22 18 20 21 21 22 18 16 14 13 15  6  3  7  8  8 13 13 15 10  6  7

The columns are year, month, day, site name, variable name, hour 00, hour 01, hour 02, ... , hour 23. Wind direction and wind speed appear on alternating lines for each day and the 24 hourly measurements for a single day are all on the same line.

What I'm doing is reading the contents of this file into a single pandas dataframe with a datetime index (hourly frequency) and two columns (wdir and wspd). My parser is below:

import pandas as pd
from datetime import timedelta

fil = 'D:\\wind.txt'
lines = open(fil, 'r').readlines()
nl = len(lines)

wdir = lines[0:nl:2]
wspd = lines[1:nl:2]

first = wdir[0].split()
start = pd.datetime(int(first[0]), int(first[1]), int(first[2]), 0)
last = wdir[-1].split()
end = pd.datetime(int(last[0]), int(last[1]), int(last[2]), 23)
drange = pd.date_range(start, end, freq='H')

wind = pd.DataFrame(pd.np.nan, index=drange, columns=['wdir','wspd'])

idate = start

for d in range(nl/2):
    dirStr = wdir[d].split()
    spdStr = wspd[d].split()
    for h in range(24):
        if dirStr[h+5] != '-9' and spdStr[h+5] != '-9':
            wind.wdir[idate] = int(dirStr[h+5]) * 10
            wind.wspd[idate] = int(spdStr[h+5])
        idate += timedelta(hours=1)
        if idate.month == 1 and idate.day == 1 and idate.hour == 1:
            print idate

Right now it takes about 2.5 seconds to parse a single year which I think is pretty good, however my colleague thinks that it should be possible to parse the full data file in a few seconds. Is he right? Am I wasting precious time writing slow, clunky parsers?

I work on a massive, legacy FORTRAN77 model and I have a couple dozen similar parsers for various input/output files to be able to analyze/create/modify them in python. If I could be saving time in each of them I would like to know how. Many thanks!


Solution

  • I'd use pd.read_fwf(...) or pd.read_csv(..., delim_whitespace=True) method - it's designed to parse such files...

    Demo:

    cols = ['year', 'month', 'day', 'site', 'var'] + ['{:02d}'.format(i) for i in range(24)]
    
    fn = r'C:\Temp\.data\43763897.txt'
    
    df = pd.read_csv(fn, names=cols, delim_whitespace=True, na_values=['-9'])
    x = pd.melt(df,
                id_vars=['year','month','day','site','var'],
                value_vars=df.columns[5:].tolist(),
                var_name='hour')
    x['date'] = pd.to_datetime(x[['year','month','day','hour']], errors='coerce')
    x = (x.drop(['year','month','day','hour'], 1)
          .pivot_table(index=['date','site'], columns='var', values='value')
          .reset_index())
    

    Result:

    In [12]: x
    Out[12]:
    var                   date site  Wdir  Wspd
    0      2000-01-01 00:00:00  CCB   5.0  10.0
    1      2000-01-01 01:00:00  CCB  11.0   8.0
    2      2000-01-01 02:00:00  CCB  15.0   6.0
    3      2000-01-01 03:00:00  CCB  14.0   8.0
    4      2000-01-01 04:00:00  CCB  14.0   7.0
    5      2000-01-01 05:00:00  CCB  14.0   7.0
    6      2000-01-01 06:00:00  CCB  14.0   8.0
    7      2000-01-01 07:00:00  CCB  16.0   8.0
    8      2000-01-01 08:00:00  CCB  15.0   6.0
    9      2000-01-01 09:00:00  CCB  15.0   8.0
    ...                    ...  ...   ...   ...
    149030 2016-12-31 14:00:00  CCB   0.0   0.0
    149031 2016-12-31 15:00:00  CCB   1.0   5.0
    149032 2016-12-31 16:00:00  CCB  33.0   8.0
    149033 2016-12-31 17:00:00  CCB  34.0   9.0
    149034 2016-12-31 18:00:00  CCB  35.0   7.0
    149035 2016-12-31 19:00:00  CCB   0.0   0.0
    149036 2016-12-31 20:00:00  CCB  12.0   8.0
    149037 2016-12-31 21:00:00  CCB  13.0   7.0
    149038 2016-12-31 22:00:00  CCB  15.0   7.0
    149039 2016-12-31 23:00:00  CCB  17.0   7.0
    
    [149040 rows x 4 columns]
    

    Timing with your wind.txt file:

    In [10]: %%timeit
        ...: cols = ['year', 'month', 'day', 'site', 'var'] + ['{:02d}'.format(i) for i in range(24)]
        ...: fn = r'D:\download\wind.txt'
        ...: df = pd.read_csv(fn, names=cols, delim_whitespace=True, na_values=['-9'])
        ...: x = pd.melt(df,
        ...:             id_vars=['year','month','day','site','var'],
        ...:             value_vars=df.columns[5:].tolist(),
        ...:             var_name='hour')
        ...: x['date'] = pd.to_datetime(x[['year','month','day','hour']], errors='coerce')
        ...: x = (x.drop(['year','month','day','hour'], 1)
        ...:       .pivot_table(index=['date','site'], columns='var', values='value')
        ...:       .reset_index())
        ...:
    1 loop, best of 3: 812 ms per loop