I got two dataframes, one (baseDataframe
) keeps some base data for every individual n
n date1 age
0 1135 2021-05-08 <NA>
1 1339 2021-08-02 <NA>
2 1456 2021-08-07 <NA>
3 5765 2021-01-02 <NA>
4 6566 2021-05-12 <NA>
5 7756 2021-06-22 <NA>
and a second dataframe (actionDataframe
) that keeps data for every action of those individuals n
n actionID action1date
0 1135 1 2022-04-01
1 1135 2 2022-06-11
2 1135 3 2022-08-04
3 1339 1 2022-05-03
4 1339 2 2022-07-21
5 1456 1 2022-08-07
6 1456 2 2022-09-30
7 1456 3 2022-11-11
8 1456 4 2023-01-12
9 5765 1 2022-04-22
10 5765 2 2022-06-11
11 6566 1 2022-04-02
12 7756 1 2022-12-12
13 7756 2 2023-02-01
now I want to fill the age
column of the baseDataframe
by getting the action1date
of the lowest actionID
for each n
from the actionDataframe
and subtract the date1
from baseDataframe
from it.
I got all of it working with a for-loop before (gonna skip the code for that one).
for n in baseDataframe.index
# calculations here
but now I read that iterating over dataframes is bad practice and slow. Now I'm wondering if there are other options to do something a bit more complex like the task I'm trying to do with built-in functions. I just can't find the right keywords to google and look up.
The end result I'm looking for:
n date1 age
0 1135 2021-05-08 328
1 1339 2021-08-02 274
2 1456 2021-08-07 365
3 5765 2021-01-02 475
4 6566 2021-05-12 325
5 7756 2021-06-22 538
How do I do it without an explicit loop?
Here's one way where for each n
, the indices of the lowest actionID
(which is found using groupby.idxmin
) is used to filter the corresponding action1date
values, which are in turn mapped to their baseDataFrame n
values and the time difference is computed. The desired column age
is the time difference in days. Note that if the datetime columns are already datetime64[ns]
dtype, then there's no need to convert to datetime again.
lowest_actionID_per_n = actionDataframe.groupby('n')['actionID'].idxmin()
action_dates = actionDataframe.loc[lowest_actionID_per_n, ['n', 'action1date']].set_index('n').squeeze()
baseDataframe['age'] = (pd.to_datetime(baseDataframe['n'].map(action_dates)) - pd.to_datetime(baseDataframe['date1'])).dt.days