I have a set of data where I want to do, If the latest st_1 or st_2 are greater than earlier st_1 or st_2 put True or False respectively in another column. How can I do that on the basis of date and id?
id date st_1 st_2
1 2022-02-28 00:00:00+00:00 60.0 6.0
2 2021-10-31 00:00:00+00:00 70.0 0.0
2 2021-12-31 00:00:00+00:00 70.0 4.0
3 2021-10-31 00:00:00+00:00 60.0 0.0
4 2021-06-30 00:00:00+00:00 63.3 2.66
4 2021-08-31 00:00:00+00:00 60.0 3.0
4 2022-02-28 00:00:00+00:00 70.0 2.0
5 2021-06-30 00:00:00+00:00 70.0 3.0
4 2022-02-28 00:00:00+00:00 70.0 2.0
5 2021-06-30 00:00:00+00:00 70.0 3.0
5 2021-08-31 00:00:00+00:00 80.0 2.0
5 2021-10-31 00:00:00+00:00 70.0 3.5
My expected outcome:
id date st_1 st_2 outcome
1 2022-02-28 00:00:00+00:00 60.0 6.0 false
2 2021-10-31 00:00:00+00:00 70.0 0.0 false
2 2021-12-31 00:00:00+00:00 70.0 4.0 true
3 2021-10-31 00:00:00+00:00 60.0 0.0 false
4 2021-06-30 00:00:00+00:00 63.3 2.66 false
4 2021-08-31 00:00:00+00:00 60.0 3.0 true
4 2022-02-28 00:00:00+00:00 70.0 2.0 true
5 2021-06-30 00:00:00+00:00 70.0 3.0 false
5 2021-08-31 00:00:00+00:00 80.0 2.0 true
5 2021-10-31 00:00:00+00:00 70.0 3.5 true
Update #2: I fixed the sorting to sort first by id and then by date and added column lag_id which is now used to ensure that comparisons are made only within the same id
Update: I just noticed the the spec is "If the latest st_1 or st_2 are greater than earlier st_1 or st_2" which means that the correct answer is to use "|" instead of the original answer's "&". Corrected.
Code:
import io
import pandas as pd
string = """id date st_1 st_2
1 "2022-02-28 00:00:00+00:00" 60.0 6.0
2 "2021-10-31 00:00:00+00:00" 70.0 0.0
2 "2021-12-31 00:00:00+00:00" 70.0 4.0
3 "2021-10-31 00:00:00+00:00" 60.0 0.0
4 "2021-06-30 00:00:00+00:00" 63.3 2.66
4 "2021-08-31 00:00:00+00:00" 60.0 3.0
4 "2022-02-28 00:00:00+00:00" 70.0 2.0
5 "2021-06-30 00:00:00+00:00" 70.0 3.0
4 "2022-02-28 00:00:00+00:00" 70.0 2.0
5 "2021-06-30 00:00:00+00:00" 70.0 3.0
5 "2021-08-31 00:00:00+00:00" 80.0 2.0
5 "2021-10-31 00:00:00+00:00" 70.0 3.5
"""
data = io.StringIO(string)
df = pd.read_csv(data, sep="\s+") # Load df0 from the data string
df.sort_values(['id', 'date'], inplace=True) # Sort according to the spec
print(df)
df['lag_id'] = df['id'].shift(1) # Lag the id column
df['lag_st_1'] = df['st_1'].shift(1) # Create column lag_st_1 with the st_1 data lagged by 1 row
df['lag_st_2'] = df['st_2'].shift(1) # Ditto for st_2
print(df)
# Create result column with True values where the right conditions are met
df.loc[(df['id'] == df['lag_id'])
& (
(df['st_1'] > df['lag_st_1'])
| (df['st_2'] > df['lag_st_2'])
), 'result'] = True
# The previous operation fills the rest of the rows with NAs.
# Here we change the NAs to "False"
df['result'] = df['result'].fillna(False)
print(df)
Updated output:
id date st_1 st_2
0 1 2022-02-28 00:00:00+00:00 60.0 6.00
1 2 2021-10-31 00:00:00+00:00 70.0 0.00
2 2 2021-12-31 00:00:00+00:00 70.0 4.00
3 3 2021-10-31 00:00:00+00:00 60.0 0.00
4 4 2021-06-30 00:00:00+00:00 63.3 2.66
5 4 2021-08-31 00:00:00+00:00 60.0 3.00
6 4 2022-02-28 00:00:00+00:00 70.0 2.00
8 4 2022-02-28 00:00:00+00:00 70.0 2.00
7 5 2021-06-30 00:00:00+00:00 70.0 3.00
9 5 2021-06-30 00:00:00+00:00 70.0 3.00
10 5 2021-08-31 00:00:00+00:00 80.0 2.00
11 5 2021-10-31 00:00:00+00:00 70.0 3.50
id date st_1 st_2 lag_id lag_st_1 lag_st_2
0 1 2022-02-28 00:00:00+00:00 60.0 6.00 NaN NaN NaN
1 2 2021-10-31 00:00:00+00:00 70.0 0.00 1.0 60.0 6.00
2 2 2021-12-31 00:00:00+00:00 70.0 4.00 2.0 70.0 0.00
3 3 2021-10-31 00:00:00+00:00 60.0 0.00 2.0 70.0 4.00
4 4 2021-06-30 00:00:00+00:00 63.3 2.66 3.0 60.0 0.00
5 4 2021-08-31 00:00:00+00:00 60.0 3.00 4.0 63.3 2.66
6 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 60.0 3.00
8 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 70.0 2.00
7 5 2021-06-30 00:00:00+00:00 70.0 3.00 4.0 70.0 2.00
9 5 2021-06-30 00:00:00+00:00 70.0 3.00 5.0 70.0 3.00
10 5 2021-08-31 00:00:00+00:00 80.0 2.00 5.0 70.0 3.00
11 5 2021-10-31 00:00:00+00:00 70.0 3.50 5.0 80.0 2.00
id date st_1 st_2 lag_id lag_st_1 lag_st_2 result
0 1 2022-02-28 00:00:00+00:00 60.0 6.00 NaN NaN NaN False
1 2 2021-10-31 00:00:00+00:00 70.0 0.00 1.0 60.0 6.00 False
2 2 2021-12-31 00:00:00+00:00 70.0 4.00 2.0 70.0 0.00 True
3 3 2021-10-31 00:00:00+00:00 60.0 0.00 2.0 70.0 4.00 False
4 4 2021-06-30 00:00:00+00:00 63.3 2.66 3.0 60.0 0.00 False
5 4 2021-08-31 00:00:00+00:00 60.0 3.00 4.0 63.3 2.66 True
6 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 60.0 3.00 True
8 4 2022-02-28 00:00:00+00:00 70.0 2.00 4.0 70.0 2.00 False
7 5 2021-06-30 00:00:00+00:00 70.0 3.00 4.0 70.0 2.00 False
9 5 2021-06-30 00:00:00+00:00 70.0 3.00 5.0 70.0 3.00 False
10 5 2021-08-31 00:00:00+00:00 80.0 2.00 5.0 70.0 3.00 True
11 5 2021-10-31 00:00:00+00:00 70.0 3.50 5.0 80.0 2.00 True