More efficient iteration method

I m looking for a better way of doing the following. The code below works but its very slow because I m working with a large dataset. I was trying to also use itertools but somehow I couldnt make it work. So here my very unpythonic starting point.

Helper function:

def signalbin(x,y):
  if x > y:
      return 1
  else:
      return -1

Test Data:

np.random.seed(0)
df = pd.DataFrame(
    {
        'a': np.random.normal(0, 2.5, n),
        'b': np.random.normal(0, 2.5, n),
    }
)

My Current code:

df["signal"] = [signalbin(x, y) for x, y in zip(df["a"], df["b"])]
df["signal2"] = df["signal"]
for i, row in df.iterrows():
    if i == 0:
        continue

    if (row['signal2'] != df.at[i-1, "signal"]):
        df.at[i, "signal2"] = df.at[i-1, "signal2"]

In this case the column signal2 is the desired result.

So I m looking for a more efficient iteration logic that allows to put conditions on multiple columns and rows

Solution

The first part will depend on your real function; it might not be easy to improve it.

The second part can be vectorized with shift, mask, and ffill.

# vectorization of the dummy example
# this might not be possible with a more complex function
df['signal'] = np.where(df['a']>df['b'], 1, -1)

# get previous row
prev = df['signal'].shift(fill_value=df['signal'].iloc[0])

# identify changing values, mask , ffill
df['signal2'] = (df['signal'].mask(df['signal'].ne(prev)).ffill()
                 .astype(df['signal'].dtype) # optional
                )

Output:

          a         b  signal  signal2
0  4.410131  0.360109       1        1
1  1.000393  3.635684      -1        1
2  2.446845  1.902594       1        1
3  5.602233  0.304188       1        1
4  4.668895  1.109658       1        1
5 -2.443195  0.834186      -1        1
6  2.375221  3.735198      -1       -1
7 -0.378393 -0.512896       1       -1
8 -0.258047  0.782669      -1       -1
9  1.026496 -2.135239       1       -1