I have a dataframe
like below:
data = [
[123456, "2017", 150.235],
[123456, "2017", 160],
[123456, "2017", 135],
[123456, "2017", 135],
[123456, "2017", 135],
[123456, "2018", 202.5],
[123456, "2019", 168.526],
[123456, "2020", 175.559],
[123456, "2020", 176],
[123456, "2021", 206.667],
[789101, "2017", 228.9],
[789101, "2018", 208],
[789101, "2018", 208],
[789101, "2018", 208],
]
df = pd.DataFrame(
data,
columns=[
"ID",
"year",
"value",
],
)
df
In this dataframe
I have an ID
column and 2+ years
. The year
columns can contain 1 or more value
columns.
I would like to filter this dataframe
so that all of the earliest year
rows (even if there are duplicate values
) and all of the latest year
rows (again, even if there are duplicate values
I want them).
My desired output is:
I found another SO question that was similar:
g = df.groupby("ID")
(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))
but it only first to the first value
within the first year
and I want all of the values
.
Can anyone please advise?!
Thank you !!
Try:
out = df.groupby("ID", group_keys=False).apply(
lambda x: x[(x.year == x.year.min()) | (x.year == x.year.max())]
)
print(out)
Prints:
ID year value
0 123456 2017 150.235
1 123456 2017 160.000
2 123456 2017 135.000
3 123456 2017 135.000
4 123456 2017 135.000
9 123456 2021 206.667
10 789101 2017 228.900
11 789101 2018 208.000
12 789101 2018 208.000
13 789101 2018 208.000