Search code examples
pythonpandasdownsampling

Python Pandas: Data Downsampling


My data looks like this:

TEST
2012-05-01 00:00:00.203 OFF 0
2012-05-01 00:00:11.203 OFF 0
2012-05-01 00:00:22.203 ON 1
2012-05-01 00:00:33.203 ON 1
2012-05-01 00:00:44.203 OFF 0
TEST
2012-05-02 00:00:00.203 OFF 0
2012-05-02 00:00:11.203 OFF 0
2012-05-02 00:00:22.203 OFF 0
2012-05-02 00:00:33.203 ON 1
2012-05-02 00:00:44.203 ON 1
2012-05-02 00:00:55.203 OFF 0

Ultimately, I want to be able to downsample data like this to individual days, using, mean, min, max -values, for example. I cannot get it to work for my data and get this error:

TypeError: unhashable type: 'list'

Perhaps it has something to do with the date format in the data frame since an index line looks like this:

[datetime.datetime(2012, 5, 1, 0, 0, 0, 203000)]   OFF  0

Can anyone help. My code so far is this:

import time
import dateutil.parser
from pandas import *
from pandas.core.datetools import *



t0 = time.clock()

filename = "testdata.dat"

index = []
data = []

with open(filename) as f:
    for line in f:
        if not line.startswith('TEST'):
            line_content =  line.split(' ')

            mydatetime =  dateutil.parser.parse(line_content[0] +  " " + line_content[1])

            del line_content[0] # delete the date
            del line_content[0] # delete the time so that only values remain

            index_row = [mydatetime]
            data_row = []
            for item in line_content:
                data_row.append(item)

            index.append(index_row)
            data.append(data_row)


df = DataFrame(data, index = index)
print df.head()
print df.tail()

print
date_from =  index[0] # first datetime entry in data frame
print date_from
date_to =  index[len(index)-1] #last datetime entry in date frame
print date_to

print date_to[0] - date_from[0]
dayly= DateRange(date_from[0], date_to[0], offset=datetools.DateOffset())
print dayly

grouped = df.groupby(dayly.asof)
#print grouped.mean()
#df2 = df.groupby(daily.asof).agg({'2':np_mean})


time2 = time.clock() - t0
print time2

Solution

  • You'd better leave all the date-time interpolation to pandas and just feed it with a clean input stream. Then you can separate fields using read_fwf (for fixed-width formatted lines). For example:

    import pandas
    import StringIO
    
    buf = StringIO.StringIO()
    buf.write(''.join(line
        for line in open('f.txt')
        if not line.startswith('TEST')))
    buf.seek(0)
    
    df = pandas.read_fwf(buf, [(0, 24), (24, 27), (27, 30)],
            index_col=0, names=['switch', 'value'])
    print df
    

    Output:

                            switch  value
    2012-05-01 00:00:00.203    OFF      0
    2012-05-01 00:00:11.203    OFF      0
    2012-05-01 00:00:22.203     ON      1
    2012-05-01 00:00:33.203     ON      1
    2012-05-01 00:00:44.203    OFF      0
    2012-05-02 00:00:00.203    OFF      0
    2012-05-02 00:00:11.203    OFF      0
    2012-05-02 00:00:22.203    OFF      0
    2012-05-02 00:00:33.203     ON      1
    2012-05-02 00:00:44.203     ON      1
    2012-05-02 00:00:55.203    OFF      0