Search code examples
pythonpandasdataframevectorization

Vectoring the FOR loop while computing rows of Dataframe


Following is the sample dataframe:

df = pd.DataFrame({
    'time': pd.date_range(start='2023-03-15 00:00:00', periods=15, freq='1min'),
    'binary': [0,0,0,0,1,0,0,0,0,1,1,1,1,1,0],
    'nonbinary': np.random.randint(1, 10, 15)
})

My task is:

To find the presence of 1s in column 'binary' (either a single occurrence or in a consecutive manner) and change the rows (that have 0s) that are two minutes back (previous) with 1s. For example, the resultant 'binary' column should be:

    [0,0,0,0,1,0,0,0,0,1,1,1,1,1,0] --> Original

  [0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0] --> New

My following snippet works fine and gives the expected result.

import pandas as pd
import numpy as np

# create the DataFrame
df = pd.DataFrame({
    'time': pd.date_range(start='2023-03-15 00:00:00', periods=15, freq='1min'),
    'binary': [0,0,1,0,1,0,0,0,0,1,1,1,1,1,0],
    'nonbinary': np.random.randint(1, 10, 15)
})

# create a copy of the binary column to store the updated values
updated_binary = df['binary'].copy()

# iterate through the rows of the DataFrame
for i in range(len(df)):

    # check if the current row has a binary value of 1
    if df.loc[i, 'binary'] == 1:

        # update the corresponding rows in binary to 1
        updated_binary[max(0, i-2):i+1] = 1

# update the binary column with the updated values
df['binary'] = updated_binary

# print the resulting DataFrame
print(df)

The problem is, my actual dataframe consists of thousands of rows. I want to make my computation faster by using vectorization. Following is my attempt to solve it using vectorization:

import pandas as pd
import numpy as np

# create the DataFrame
df = pd.DataFrame({
    'time': pd.date_range(start='2023-03-15 00:00:00', periods=15, freq='1min'),
    'binary': [0,0,0,0,1,0,0,0,0,1,1,1,1,1,0],
    'nonbinary': np.random.randint(1, 10, 15)
})

# create a copy of the binary column to store the updated values
updated_binary = df['binary'].copy()

# create a boolean array to identify the positions of 1s in binary column
Ones_positions = df['binary'].eq(1).values

# create a sliding window of size 3 to find the positions of 0s two minutes before the 1s
window = np.zeros(3, dtype=bool)
window[:2] = True

zero_positions = np.logical_and.reduce((
    np.roll(Ones_positions, 2), 
    np.logical_not(Ones_positions),
    np.convolve(window, np.ones(3, dtype=bool), mode='valid') == 2
), axis=0)

# update the corresponding positions in the updated_binary to 1
updated_binary[zero_positions] = 1

# update the binary column with the updated values
df['binary'] = updated_binary

# print the resulting DataFrame
print(df)

But I get the following error:

zero_positions = np.logical_and.reduce((
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Solution

  • Use a reversed rolling.max, with 2+1 minutes as period:

    N = 2
    df['new'] = (df[::-1].rolling(f'{N+1}min', on='time',
                                  min_periods=1)
                 ['binary'].max()
                )
    

    NB. we need to add +1 to account for the current row.

    Output:

                      time  binary  nonbinary  new
    0  2023-03-15 00:00:00       0          2  0.0
    1  2023-03-15 00:01:00       0          2  0.0
    2  2023-03-15 00:02:00       0          5  1.0
    3  2023-03-15 00:03:00       0          7  1.0
    4  2023-03-15 00:04:00       1          4  1.0
    5  2023-03-15 00:05:00       0          6  0.0
    6  2023-03-15 00:06:00       0          9  0.0
    7  2023-03-15 00:07:00       0          8  1.0
    8  2023-03-15 00:08:00       0          4  1.0
    9  2023-03-15 00:09:00       1          3  1.0
    10 2023-03-15 00:10:00       1          1  1.0
    11 2023-03-15 00:11:00       1          4  1.0
    12 2023-03-15 00:12:00       1          1  1.0
    13 2023-03-15 00:13:00       1          5  1.0
    14 2023-03-15 00:14:00       0          5  0.0