Search code examples
pythonpandasdataframegroup-by

How to filter pandas dataframe so that the first and last rows within a group are retained?


I have a dataframe like below:

data = [
    [123456, "2017", 150.235],
    [123456, "2017", 160],
    [123456, "2017", 135],
    [123456, "2017", 135],
    [123456, "2017", 135],
    [123456, "2018", 202.5],
    [123456, "2019", 168.526],
    [123456, "2020", 175.559],
    [123456, "2020", 176],
    [123456, "2021", 206.667],
    [789101, "2017", 228.9],
    [789101, "2018", 208],
    [789101, "2018", 208],
    [789101, "2018", 208],
]

df = pd.DataFrame(
    data,
    columns=[
        "ID",
        "year",
        "value",
    ],
)
df

In this dataframe I have an ID column and 2+ years. The year columns can contain 1 or more value columns.

I would like to filter this dataframe so that all of the earliest year rows (even if there are duplicate values) and all of the latest year rows (again, even if there are duplicate values I want them).

My desired output is:

enter image description here

I found another SO question that was similar:

g = df.groupby("ID")

(pd.concat([g.head(1), g.tail(1)])
   .drop_duplicates()
   .sort_values('ID')
   .reset_index(drop=True))

but it only first to the first value within the first year and I want all of the values.

Can anyone please advise?!

Thank you !!


Solution

  • Try:

    out = df.groupby("ID", group_keys=False).apply(
        lambda x: x[(x.year == x.year.min()) | (x.year == x.year.max())]
    )
    print(out)
    

    Prints:

            ID  year    value
    0   123456  2017  150.235
    1   123456  2017  160.000
    2   123456  2017  135.000
    3   123456  2017  135.000
    4   123456  2017  135.000
    9   123456  2021  206.667
    10  789101  2017  228.900
    11  789101  2018  208.000
    12  789101  2018  208.000
    13  789101  2018  208.000