Reposted with clarification.
I am working on a dataframe that looks like the following:
+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
| 1 | 5 | 2012 | 111 |
| 1 | 5 | 2012 | 112 |
| 0 | 12 | 2017 | 113 |
| 0 | 12 | 2022 | 114 |
| 1 | 27 | 2005 | 115 |
| 1 | 27 | 2011 | 116 |
+-------+----+------+-----+
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
+-------+----+------+------+-------+
| Value | ID | Date | ID 2 |Latest |
+-------+----+------+------+-------+
| 1 | 5 | 2012 | 111 | |
| 1 | 5 | 2012 | 112 | Latest |
| 0 | 12 | 2017 | 113 | |
| 0 | 12 | 2022 | 114 | |
| 1 | 27 | 2005 | 115 | |
| 1 | 27 | 2011 | 116 | Latest |
+-------+----+------+-----+--------+
I am using the following code to find the maximum:
latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups.
latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
but it is giving me a dataframe completely different than the one desired.
Here's one way a bit similar to your own approach. Basically, groupby
+ last
to get the latest + assign
a variable + merge
:
df = df.merge(df.groupby(['ID', 'Value'])['ID 2'].last().reset_index().assign(Latest=lambda x: np.where(x['Value'], 'Latest', '')), how='outer').fillna('')
or even this works:
df = df.query('Value==1').groupby('ID').last('ID 2').assign(Latest='Latest').merge(df, how='outer').fillna('')
Output:
Value ID Date ID 2 Latest
0 1 5 2012 111
1 1 5 2012 112 Latest
2 0 12 2017 113
3 0 12 2022 114
4 1 27 2005 115
5 1 27 2011 116 Latest