Search code examples
pythonpandasapplypandas-loc

Pandas Apply and Loc - efficiency and indexing


I want to find the first value after each row that meets a certain criteria. So for example I want to find the first rate/value (not necessarily the first row after) after the current row that increased 5%. The added column would be the last 'first5percentIncrease' and would be the index (and/or value) of the first row (after current row) that had a 5% increase. Notice how each could not be lower than the current row's index.

          amount    date    rate    total   type first5percentIncreaseValue first5percentIncreaseIndex
9248    0.05745868  2018-01-22 06:11:36 10  0.00099984  buy 10.5 9341
9249    1.14869147  2018-01-22 06:08:38 20  0.01998989  buy 21 9421
9250    0.16498080  2018-01-22 06:02:59 15  0.00286241  sell 15.75 9266
9251    0.02881844  2018-01-22 06:01:54 2   0.00049999  sell 2.1 10911

I tried using loc to apply() this to each row. The output takes at least 10 seconds for only about 9k rows. This does the job (I get a list of all values 5% higher than the given row) but is there a more efficient way to do this? Also I'd like to get only the first value but when I take do this I think it's starting from the first row. Is there a way to start .locs search from the current row so then I could just take the first value?

coin_trade_history_df['rate'].apply(
    lambda y: coin_trade_history_df['rate'].loc[coin_trade_history_df['rate'].apply(
        lambda x: y  >= x + (x*.005))])

0    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
1    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
2    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
3    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
4    [0.01387146, 0.01387146, 0.01387148, 0.0138714...
Name: rate, dtype: object

Further clarification Peter Leimbigler said it better than me:

Oh, I think I get it now! "For each row, scan downward and get the first row you encounter that shows an increase of at least 5%," right? I'll edit my answer :) – Peter Leimbigler


Solution

  • Here's an approach to the specific example of labeling each row with the index of the next available row that shows an increase of at least 5%.

    # Example data
    df = pd.DataFrame({'rate': [100, 105, 99, 110, 130, 120, 98]})
    
    # Series.shift(n) moves elements n places forward = down. We use
    # it here in the denominator in order to compare each change with 
    # the initial value, rather than the final value.
    
    mask = df.rate.diff()/df.rate.shift() >= 0.05
    
    df.loc[mask, 'next_big_change_idx'] = df[mask].index
    df.next_big_change_idx = df.next_big_change_idx.bfill().shift(-1)
    
    # output
    df
       rate  next_big_change_idx
    0   100                  1.0
    1   105                  3.0
    2    99                  3.0
    3   110                  4.0
    4   130                  NaN
    5   120                  NaN
    6    98                  NaN