I am currently using pandas to manage a CSV file that contains data on vessel locations on a map. I have a Pandas data frame that looks like this (which is simplified):
Index | Group | C_p |
---|---|---|
1 | 1 | 27 |
2 | 1 | 85 |
3 | 1 | 83 |
4 | 1 | 78 |
5 | 1 | 66 |
6 | 1 | 47 |
7 | 3 | 82 |
8 | 3 | 80 |
9 | 3 | 66 |
C_p is a confidence coefficient to judge if a vessel is parked.
The way to judge a parked vessel is like this:
If the average value of a sequence of data's C_p is above 80
, this sequence of data is a parked vessel, and the minimum number of a sequence is above 3
.
So in this example, index 2 to 4 is a sequence of parked vessels because the average value of C_p is 82.67 which is above 80. Even though index 7 to 8, the average value of C_p is above 80, the sequence just has two elements so this is not a parked vessel.
So I want to get this:
Index | Group | C_p | is_parked |
---|---|---|---|
1 | 1 | 27 | 0 |
2 | 1 | 85 | 1 |
3 | 1 | 83 | 1 |
4 | 1 | 78 | 1 |
5 | 1 | 66 | 0 |
6 | 1 | 47 | 0 |
7 | 3 | 82 | 0 |
8 | 3 | 80 | 0 |
9 | 3 | 66 | 0 |
I attempt to change this method into a Leetcode question like:
for i in range(n - 2): # n is the element number of a group
avg = (C_p[i] + C_p[i+1] + C_p[i+2]) / 3
if avg >= 80:
is_parked[i] = is_parked[i+1] = is_parked[i+2] = 1 # default value of is_parked is 0
But I have no idea how to Implement it with Pandas or Python.
So my question is:
Is there any way to implement this in Pandas?
If not, should I turn to Numpy or something else to get the results?
Edit: 1 Nick's comment allows me to double-check my example and my method.
If C_p[5] = 78, it's still a parked vessel and it's computed on a per Group.
I hadn't considered that so my first attempt was wrong. Here's my new attempt
for i in range(n):
thisSUM = C_p[i]
for j in range(i+1, n):
thisSUM += C_p[j]
avg = thisSUM / (j - i + 1)
if avg >= 80 and j - i + 1 >= 3:
for k in range(i, j+1):
is_parked_check[k] = 1
Time complexity is O(n^3), which is not good.
You can use a double rolling
in a groupby.transform
. The first one is to compute the rolling average, we then check if the value is above the threshold. Then we reverse the Series and compute a rolling max to propagate the True/1 to the previous N-1 rows:
N = 3
threshold = 80
df['is_parked'] = (df.groupby('Group')['C_p']
.transform(
lambda s: s.rolling(N).mean().gt(threshold)[::-1]
.rolling(N, min_periods=1).max()
.astype(int)
)
)
Output:
Index Group C_p is_parked
0 1 1 27 0
1 2 1 85 1
2 3 1 83 1
3 4 1 78 1
4 5 1 66 0
5 6 1 47 0
6 7 3 82 0
7 8 3 80 0
8 9 3 66 0
Intermediates:
Index Group C_p formula avg >80 rev_roll
0 1 1 27 NaN NaN False 0
1 2 1 85 NaN NaN False 1
2 3 1 83 (83+85+27)/3 65.000000 False 1
3 4 1 78 (78+83+85)/3 82.000000 True 1
4 5 1 66 (66+78+83)/3 75.666667 False 0
5 6 1 47 (47+66+78)/3 63.666667 False 0
6 7 3 82 NaN NaN False 0
7 8 3 80 NaN NaN False 0
8 9 3 66 (66+80+82)/3 76.000000 False 0