Search code examples
pythonpython-3.xpandasaggregate

Aggregate measurements in an efficient manner


Im trying to add together measurements from a data set. I have measurements from every minut, and i need to find the sum of every hour for a whole year.

This is what i have at the moment. It works but it is slow. There might be more problems with it, but it was what made sense.

def aggregate_measurements(tvec, data, period):
tvec_a = []
data_a = []

        if period == 'hour':
            for i in range(0, len(tvec), 60):
        
            timecomp = tvec.iloc[i:i+60]
            datacomp = data.iloc[i:i+60]
            tvec_a.append(timecomp.iloc[0]['year':'second'])
            data_summeret = datacomp.sum()
            data_a.append(data_summeret)
        return tvec_a, data_a
        

Is there a better way to do this?


Solution

  • You should be using vectorized operations whenever possible. Like groupby

    import pandas as pd
    
    # Assuming tvec is a datetime column in your DataFrame. If not - convert
    df['hour'] = df['tvec'].dt.floor('H')  # Create a new column with the hour component of the timestamp
    
    hourly_data = df.groupby('hour')['data'].sum().reset_index()
    

    The dt.floor('H') is used to round down the timestamps to the nearest hour