Following is the sample dataframe:
df = pd.DataFrame({
'time': pd.date_range(start='2023-03-15 00:00:00', periods=15, freq='1min'),
'binary': [0,0,0,0,1,0,0,0,0,1,1,1,1,1,0],
'nonbinary': np.random.randint(1, 10, 15)
})
My task is:
To find the presence of 1s
in column 'binary' (either a single occurrence or in a consecutive manner) and change the rows (that have 0s
) that are two minutes back (previous) with 1s
. For example, the resultant 'binary' column should be:
[0,0,0,0,1,0,0,0,0,1,1,1,1,1,0] --> Original
[0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0] --> New
My following snippet works fine and gives the expected result.
import pandas as pd
import numpy as np
# create the DataFrame
df = pd.DataFrame({
'time': pd.date_range(start='2023-03-15 00:00:00', periods=15, freq='1min'),
'binary': [0,0,1,0,1,0,0,0,0,1,1,1,1,1,0],
'nonbinary': np.random.randint(1, 10, 15)
})
# create a copy of the binary column to store the updated values
updated_binary = df['binary'].copy()
# iterate through the rows of the DataFrame
for i in range(len(df)):
# check if the current row has a binary value of 1
if df.loc[i, 'binary'] == 1:
# update the corresponding rows in binary to 1
updated_binary[max(0, i-2):i+1] = 1
# update the binary column with the updated values
df['binary'] = updated_binary
# print the resulting DataFrame
print(df)
The problem is, my actual dataframe
consists of thousands of rows. I want to make my computation faster by using vectorization. Following is my attempt to solve it using vectorization:
import pandas as pd
import numpy as np
# create the DataFrame
df = pd.DataFrame({
'time': pd.date_range(start='2023-03-15 00:00:00', periods=15, freq='1min'),
'binary': [0,0,0,0,1,0,0,0,0,1,1,1,1,1,0],
'nonbinary': np.random.randint(1, 10, 15)
})
# create a copy of the binary column to store the updated values
updated_binary = df['binary'].copy()
# create a boolean array to identify the positions of 1s in binary column
Ones_positions = df['binary'].eq(1).values
# create a sliding window of size 3 to find the positions of 0s two minutes before the 1s
window = np.zeros(3, dtype=bool)
window[:2] = True
zero_positions = np.logical_and.reduce((
np.roll(Ones_positions, 2),
np.logical_not(Ones_positions),
np.convolve(window, np.ones(3, dtype=bool), mode='valid') == 2
), axis=0)
# update the corresponding positions in the updated_binary to 1
updated_binary[zero_positions] = 1
# update the binary column with the updated values
df['binary'] = updated_binary
# print the resulting DataFrame
print(df)
But I get the following error:
zero_positions = np.logical_and.reduce((
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Use a reversed rolling.max
, with 2+1 minutes as period:
N = 2
df['new'] = (df[::-1].rolling(f'{N+1}min', on='time',
min_periods=1)
['binary'].max()
)
NB. we need to add +1 to account for the current row.
Output:
time binary nonbinary new
0 2023-03-15 00:00:00 0 2 0.0
1 2023-03-15 00:01:00 0 2 0.0
2 2023-03-15 00:02:00 0 5 1.0
3 2023-03-15 00:03:00 0 7 1.0
4 2023-03-15 00:04:00 1 4 1.0
5 2023-03-15 00:05:00 0 6 0.0
6 2023-03-15 00:06:00 0 9 0.0
7 2023-03-15 00:07:00 0 8 1.0
8 2023-03-15 00:08:00 0 4 1.0
9 2023-03-15 00:09:00 1 3 1.0
10 2023-03-15 00:10:00 1 1 1.0
11 2023-03-15 00:11:00 1 4 1.0
12 2023-03-15 00:12:00 1 1 1.0
13 2023-03-15 00:13:00 1 5 1.0
14 2023-03-15 00:14:00 0 5 0.0