Stuck coming from SQL world, trying to use pandas
and resist the urge to go back to the database
and write this where I am more comfortable.
For my purposes, just assume there are four columns [FIRST_NAME][LAST_NAME][CAR_ID][DATE]
I want to add a column, let’s call it [MIN_DATE]
BOB ONE 1234 1/1/2020
BOB ONE 1234 2/1/2020
BOB ONE 2345 3/1/2020
BOB ONE 2345 4/1/2020
BOB ONE 1234 1/1/2023
BOB ONE 1234 2/1/2023
BOB TWO 1234 1/1/2022
BOB TWO 1234 2/1/2022
BOB TWO 2345 3/1/2022
BOB TWO 2345 4/1/2022
BOB TWO 1234 3/1/2023
BOB TWO 1234 4/1/2023
I want to add a column with the first value for DATE reset on the sorted list whenever the car_id changes. I don’t know enough about pandas
to know if there is a function that can do this. Or, if I should just be trying to do it in python.
BOB ONE 1234 1/1/2020 1/1/2020
BOB ONE 1234 2/1/2020 1/1/2020
BOB ONE 2345 3/1/2020 3/1/2020
BOB ONE 2345 4/1/2020 3/1/2020
BOB ONE 1234 1/1/2023 1/1/2023
BOB ONE 1234 2/1/2023 1/1/2023
BOB TWO 1234 1/1/2022 1/1/2022
BOB TWO 1234 2/1/2022 1/1/2022
BOB TWO 2345 3/1/2022 3/1/2022
BOB TWO 2345 4/1/2022 3/1/2022
BOB TWO 1234 3/1/2023 3/1/2023
BOB TWO 1234 4/1/2023 3/1/2023
I can't simply group by name and ID and get the min, because that would give me the top value for the later record.
BOB ONE 1234 1/1/2020 1/1/2020 <<
BOB ONE 1234 2/1/2020 1/1/2020
BOB ONE 2345 3/1/2020 3/1/2020
BOB ONE 2345 4/1/2020 3/1/2020
BOB ONE 1234 1/1/2023 1/1/2023 << group by would give 1/1/2020 but i need 1/1/2023
BOB ONE 1234 2/1/2023 1/1/2023 <<
Struggling to figure out how to approach this with pandas
. Ideally want a function that I can just say Minimum date over name, car ID, reset context on change in name and car ID based on sort – which is something I've seen in other data tools, just can't figure out if there's anything like this in pandas.
addressing comment from sundace: simple group by does not work in your example, since the sequence matters and I need to reset on a change in value in the sequence.
df['min_value'] = df.groupby('group').value.transform(np.min)
extending your example you will see the problem is not quite the same so your solution does not work for this
A 1 1
A 2 1
A 3 1
B 4 4
B 5 4
B 5 4
A 6 1 <order of the list matters, this should reflect 4 not 1
A 7 1
A 8 1
Here's one approach:
DATE
strings into datetime using pd.to_datetime
.CAR_ID
we want to put consecutive values into separate groups. Check for inequality (Series.ne
) with the column's shift (Series.shift
). A new group starts where True
, so we can apply Series.cumsum
to get the groups.df.groupby
, select DATE
and apply transform
with min
.df = pd.DataFrame(data)
# turn date strings into datetime
# (N.B. your date format is ambiguous! Adjust if it is `%m/%d/%Y`!)
df['DATE'] = pd.to_datetime(df['DATE'], format='%d/%m/%Y')
# consecutive values to groups
car_g = df['CAR_ID'].ne(df['CAR_ID'].shift()).cumsum()
# add `car_g` inside groupby, get `min` transformed and assign
df['MIN_DATE'] = df.groupby(['FIRST_NAME','LAST_NAME',car_g])['DATE'].transform('min')
df
FIRST_NAME LAST_NAME CAR_ID DATE MIN_DATE
0 BOB ONE 1234 2020-01-01 2020-01-01
1 BOB ONE 1234 2020-01-02 2020-01-01
2 BOB ONE 2345 2020-01-03 2020-01-03
3 BOB ONE 2345 2020-01-04 2020-01-03
4 BOB ONE 1234 2023-01-01 2023-01-01
5 BOB ONE 1234 2023-01-02 2023-01-01
6 BOB TWO 1234 2022-01-01 2022-01-01
7 BOB TWO 1234 2022-01-02 2022-01-01
8 BOB TWO 2345 2022-01-03 2022-01-03
9 BOB TWO 2345 2022-01-04 2022-01-03
10 BOB TWO 1234 2023-01-03 2023-01-03
11 BOB TWO 1234 2023-01-04 2023-01-03