I m looking for a better way of doing the following. The code below works but its very slow because I m working with a large dataset. I was trying to also use itertools but somehow I couldnt make it work. So here my very unpythonic starting point.
Helper function:
def signalbin(x,y):
if x > y:
return 1
else:
return -1
Test Data:
np.random.seed(0)
df = pd.DataFrame(
{
'a': np.random.normal(0, 2.5, n),
'b': np.random.normal(0, 2.5, n),
}
)
My Current code:
df["signal"] = [signalbin(x, y) for x, y in zip(df["a"], df["b"])]
df["signal2"] = df["signal"]
for i, row in df.iterrows():
if i == 0:
continue
if (row['signal2'] != df.at[i-1, "signal"]):
df.at[i, "signal2"] = df.at[i-1, "signal2"]
In this case the column signal2 is the desired result.
So I m looking for a more efficient iteration logic that allows to put conditions on multiple columns and rows
The first part will depend on your real function; it might not be easy to improve it.
The second part can be vectorized with shift
, mask
, and ffill
.
# vectorization of the dummy example
# this might not be possible with a more complex function
df['signal'] = np.where(df['a']>df['b'], 1, -1)
# get previous row
prev = df['signal'].shift(fill_value=df['signal'].iloc[0])
# identify changing values, mask , ffill
df['signal2'] = (df['signal'].mask(df['signal'].ne(prev)).ffill()
.astype(df['signal'].dtype) # optional
)
Output:
a b signal signal2
0 4.410131 0.360109 1 1
1 1.000393 3.635684 -1 1
2 2.446845 1.902594 1 1
3 5.602233 0.304188 1 1
4 4.668895 1.109658 1 1
5 -2.443195 0.834186 -1 1
6 2.375221 3.735198 -1 -1
7 -0.378393 -0.512896 1 -1
8 -0.258047 0.782669 -1 -1
9 1.026496 -2.135239 1 -1