Search code examples
pythonfor-loopvectorization

Vectorize for-loop which combines two different dataframes


I have a two-columns dataframe that contains the input data. The first column is a starting date and the second is called duration (minutes). You can think of a machine that runs from start to start+duration. I want to use this information to build an 1d-array of length 8760*60 that contains all the minutes of the year and there should be a 1 at the place where the machine runs and a zero otherwise. The following MWE does the task, but is slow because of the for-loop and I would not know how to vectorize it.

import pandas as pd
import numpy as np

# Start and end of time horizon
start = pd.Timestamp(year=2019, month=1, day=1, hour=0, tz='UTC')
end = pd.Timestamp(year=2019, month=12, day=31, hour=23, minute=59, tz='UTC')

# DataFrame of time horizon
dates = pd.DataFrame(pd.date_range(start, end, freq='min'))

# Starting points
t1 = pd.Timestamp(year=2019, month=1, day=2, hour=0, tz='UTC')
t2 = pd.Timestamp(year=2019, month=1, day=1, hour=0, minute=3, tz='UTC')

# Durations
d1 = 5
d2 = 30

# DataFrame from input data
data = pd.DataFrame(
    data=[
        [t1, d1],
        [t2, d2],
        ],
    columns=[
        'start',
        'duration',
        ]
    )

# Array to be filled
on = np.zeros(8760*60)

# loop over data rows 
for idx in data.index:
    # Start for on array from dates 
    start = dates[dates[0] == data.loc[idx, 'start']].index[0]
    
    # Duration from data
    duration = data.loc[idx, 'duration']

    # Put 1s in the on array from start to start+duration
    on[start: start+duration] = 1

Solution

  • Does this work for you:

    idx = pd.date_range(pd.Timestamp('2019-01-01', tz='UTC'),
                        pd.Timestamp('2019-12-31', tz='UTC'),
                        freq='1min')
    df = pd.DataFrame({'on': 0}, index=idx)
    
    def to_mins(row):
        return set(pd.date_range(row['start'], periods=row['duration'], freq='1min'))
    
    idx_on = set().union(*data[['start', 'duration']].apply(to_mins, axis='columns'))
    df.loc[idx_on] = 1
    on = df.on.values
    

    If it's possible that the duration leads to timestamps outside of 2019, you could use:

    def to_min_range(row):
        return set(
            m 
            for m in pd.date_range(row['start'], periods=row['duration'], freq='1min')
            if m.year == 2019
        )