Search code examples
pythonpandasdataframealgorithmiterator

How to get average value of a sequence using iterator in Pandas?


I am currently using pandas to manage a CSV file that contains data on vessel locations on a map. I have a Pandas data frame that looks like this (which is simplified):

Index Group C_p
1 1 27
2 1 85
3 1 83
4 1 78
5 1 66
6 1 47
7 3 82
8 3 80
9 3 66

C_p is a confidence coefficient to judge if a vessel is parked.

The way to judge a parked vessel is like this:

If the average value of a sequence of data's C_p is above 80, this sequence of data is a parked vessel, and the minimum number of a sequence is above 3.

So in this example, index 2 to 4 is a sequence of parked vessels because the average value of C_p is 82.67 which is above 80. Even though index 7 to 8, the average value of C_p is above 80, the sequence just has two elements so this is not a parked vessel.

So I want to get this:

Index Group C_p is_parked
1 1 27 0
2 1 85 1
3 1 83 1
4 1 78 1
5 1 66 0
6 1 47 0
7 3 82 0
8 3 80 0
9 3 66 0

I attempt to change this method into a Leetcode question like:

for i in range(n - 2): # n is the element number of a group
    avg = (C_p[i] + C_p[i+1] + C_p[i+2]) / 3
    if avg >= 80:
        is_parked[i] = is_parked[i+1] = is_parked[i+2] = 1 # default value of is_parked is 0

But I have no idea how to Implement it with Pandas or Python.

So my question is:

  1. Is there any way to implement this in Pandas?

  2. If not, should I turn to Numpy or something else to get the results?


Edit: 1 Nick's comment allows me to double-check my example and my method.

If C_p[5] = 78, it's still a parked vessel and it's computed on a per Group.

I hadn't considered that so my first attempt was wrong. Here's my new attempt

for i in range(n):
    thisSUM = C_p[i]
    for j in range(i+1, n):
        thisSUM += C_p[j]
        avg = thisSUM / (j - i + 1)
        if avg >= 80 and j - i + 1 >= 3:
            for k in range(i, j+1):
                is_parked_check[k] = 1

Time complexity is O(n^3), which is not good.


Solution

  • You can use a double rolling in a groupby.transform. The first one is to compute the rolling average, we then check if the value is above the threshold. Then we reverse the Series and compute a rolling max to propagate the True/1 to the previous N-1 rows:

    N = 3
    threshold = 80
    
    df['is_parked'] = (df.groupby('Group')['C_p']
                         .transform(
                            lambda s: s.rolling(N).mean().gt(threshold)[::-1]
                                       .rolling(N, min_periods=1).max()
                                       .astype(int)
                         )
                       )
    

    Output:

       Index  Group  C_p  is_parked
    0      1      1   27          0
    1      2      1   85          1
    2      3      1   83          1
    3      4      1   78          1
    4      5      1   66          0
    5      6      1   47          0
    6      7      3   82          0
    7      8      3   80          0
    8      9      3   66          0
    

    Intermediates:

       Index  Group  C_p       formula        avg    >80  rev_roll
    0      1      1   27           NaN        NaN  False         0
    1      2      1   85           NaN        NaN  False         1
    2      3      1   83  (83+85+27)/3  65.000000  False         1
    3      4      1   78  (78+83+85)/3  82.000000   True         1
    4      5      1   66  (66+78+83)/3  75.666667  False         0
    5      6      1   47  (47+66+78)/3  63.666667  False         0
    6      7      3   82           NaN        NaN  False         0
    7      8      3   80           NaN        NaN  False         0
    8      9      3   66  (66+80+82)/3  76.000000  False         0