Search code examples
pythonpandasdataframeloopsgroup-by

How to fill a pandas column by calculation involving values from two dataframes


I got two dataframes, one (baseDataframe) keeps some base data for every individual n

      n       date1   age
0  1135  2021-05-08  <NA>
1  1339  2021-08-02  <NA>
2  1456  2021-08-07  <NA>
3  5765  2021-01-02  <NA>
4  6566  2021-05-12  <NA>
5  7756  2021-06-22  <NA>

and a second dataframe (actionDataframe) that keeps data for every action of those individuals n

       n  actionID action1date
0   1135         1  2022-04-01
1   1135         2  2022-06-11
2   1135         3  2022-08-04
3   1339         1  2022-05-03
4   1339         2  2022-07-21
5   1456         1  2022-08-07
6   1456         2  2022-09-30
7   1456         3  2022-11-11
8   1456         4  2023-01-12
9   5765         1  2022-04-22
10  5765         2  2022-06-11
11  6566         1  2022-04-02
12  7756         1  2022-12-12
13  7756         2  2023-02-01

now I want to fill the age column of the baseDataframe by getting the action1date of the lowest actionID for each n from the actionDataframe and subtract the date1 from baseDataframe from it.

I got all of it working with a for-loop before (gonna skip the code for that one).

for n in baseDataframe.index
# calculations here

but now I read that iterating over dataframes is bad practice and slow. Now I'm wondering if there are other options to do something a bit more complex like the task I'm trying to do with built-in functions. I just can't find the right keywords to google and look up.

The end result I'm looking for:

      n       date1  age
0  1135  2021-05-08  328
1  1339  2021-08-02  274
2  1456  2021-08-07  365
3  5765  2021-01-02  475
4  6566  2021-05-12  325
5  7756  2021-06-22  538

How do I do it without an explicit loop?


Solution

  • Here's one way where for each n, the indices of the lowest actionID (which is found using groupby.idxmin) is used to filter the corresponding action1date values, which are in turn mapped to their baseDataFrame n values and the time difference is computed. The desired column age is the time difference in days. Note that if the datetime columns are already datetime64[ns] dtype, then there's no need to convert to datetime again.

    lowest_actionID_per_n = actionDataframe.groupby('n')['actionID'].idxmin()
    action_dates = actionDataframe.loc[lowest_actionID_per_n, ['n', 'action1date']].set_index('n').squeeze()
    baseDataframe['age'] = (pd.to_datetime(baseDataframe['n'].map(action_dates)) - pd.to_datetime(baseDataframe['date1'])).dt.days
    

    result