Search code examples
pythonpandasdataframegroup-by

Pandas/python min value in a sequence/group, reset by column


Stuck coming from SQL world, trying to use pandas and resist the urge to go back to the database and write this where I am more comfortable.

For my purposes, just assume there are four columns [FIRST_NAME][LAST_NAME][CAR_ID][DATE]

I want to add a column, let’s call it [MIN_DATE]

BOB ONE 1234 1/1/2020
BOB ONE 1234 2/1/2020
BOB ONE 2345 3/1/2020
BOB ONE 2345 4/1/2020
BOB ONE 1234 1/1/2023
BOB ONE 1234 2/1/2023
BOB TWO 1234 1/1/2022
BOB TWO 1234 2/1/2022
BOB TWO 2345 3/1/2022
BOB TWO 2345 4/1/2022
BOB TWO 1234 3/1/2023
BOB TWO 1234 4/1/2023

I want to add a column with the first value for DATE reset on the sorted list whenever the car_id changes. I don’t know enough about pandas to know if there is a function that can do this. Or, if I should just be trying to do it in python.

BOB ONE 1234 1/1/2020 1/1/2020
BOB ONE 1234 2/1/2020 1/1/2020
BOB ONE 2345 3/1/2020 3/1/2020
BOB ONE 2345 4/1/2020 3/1/2020
BOB ONE 1234 1/1/2023 1/1/2023
BOB ONE 1234 2/1/2023 1/1/2023
BOB TWO 1234 1/1/2022 1/1/2022
BOB TWO 1234 2/1/2022 1/1/2022
BOB TWO 2345 3/1/2022 3/1/2022
BOB TWO 2345 4/1/2022 3/1/2022
BOB TWO 1234 3/1/2023 3/1/2023
BOB TWO 1234 4/1/2023 3/1/2023

I can't simply group by name and ID and get the min, because that would give me the top value for the later record.

BOB ONE 1234 1/1/2020 1/1/2020 <<
BOB ONE 1234 2/1/2020 1/1/2020
BOB ONE 2345 3/1/2020 3/1/2020
BOB ONE 2345 4/1/2020 3/1/2020
BOB ONE 1234 1/1/2023 1/1/2023 << group by would give 1/1/2020 but i  need 1/1/2023
BOB ONE 1234 2/1/2023 1/1/2023 <<

Struggling to figure out how to approach this with pandas. Ideally want a function that I can just say Minimum date over name, car ID, reset context on change in name and car ID based on sort – which is something I've seen in other data tools, just can't figure out if there's anything like this in pandas.

addressing comment from sundace: simple group by does not work in your example, since the sequence matters and I need to reset on a change in value in the sequence.

df['min_value'] = df.groupby('group').value.transform(np.min)

extending your example you will see the problem is not quite the same so your solution does not work for this

A 1 1
A 2 1
A 3 1
B 4 4
B 5 4
B 5 4
A 6 1 <order of the list matters, this should reflect 4 not 1
A 7 1
A 8 1

Solution

  • Here's one approach:

    • First turn your DATE strings into datetime using pd.to_datetime.
    • For CAR_ID we want to put consecutive values into separate groups. Check for inequality (Series.ne) with the column's shift (Series.shift). A new group starts where True, so we can apply Series.cumsum to get the groups.
    • Now add the result to the other groups inside df.groupby, select DATE and apply transform with min.
    df = pd.DataFrame(data)
    
    # turn date strings into datetime 
    # (N.B. your date format is ambiguous! Adjust if it is `%m/%d/%Y`!)
    
    df['DATE'] = pd.to_datetime(df['DATE'], format='%d/%m/%Y')
    
    # consecutive values to groups
    car_g = df['CAR_ID'].ne(df['CAR_ID'].shift()).cumsum()
    
    # add `car_g` inside groupby, get `min` transformed and assign
    df['MIN_DATE'] = df.groupby(['FIRST_NAME','LAST_NAME',car_g])['DATE'].transform('min')
    
    df
    
       FIRST_NAME LAST_NAME  CAR_ID       DATE   MIN_DATE
    0         BOB       ONE    1234 2020-01-01 2020-01-01
    1         BOB       ONE    1234 2020-01-02 2020-01-01
    2         BOB       ONE    2345 2020-01-03 2020-01-03
    3         BOB       ONE    2345 2020-01-04 2020-01-03
    4         BOB       ONE    1234 2023-01-01 2023-01-01
    5         BOB       ONE    1234 2023-01-02 2023-01-01
    6         BOB       TWO    1234 2022-01-01 2022-01-01
    7         BOB       TWO    1234 2022-01-02 2022-01-01
    8         BOB       TWO    2345 2022-01-03 2022-01-03
    9         BOB       TWO    2345 2022-01-04 2022-01-03
    10        BOB       TWO    1234 2023-01-03 2023-01-03
    11        BOB       TWO    1234 2023-01-04 2023-01-03