I'm wondering how to optimize a part of code to remove a loop which takes forever since I have around 350 000 IDs. Here is the current code, which is not optimal and takes quite a while. I'm trying to get it working better and if possible removing a loop.
The dataset is made of 4 columns with IDs, start_dates, end_dates and amount. We can have multi rows with same IDs but not the same amount. The main thing is in some rows the dates are not saved in the dataset. In that case we have to find the earlier start_date of the ID and the later end_date and add them to the row where it's not put in the dataframe
ID start_date end_date value
ABC 12/10/2010 12/12/2020 8
ABC 01/01/2020 01/04/2021 9
ABC 43
BCD 14/02/2020 14/03/2020 8
So we should have on the third row the start_date as 12/10/2010 and end date 01/04/2021. In the picture you cant see it but don't forget that BCD start_date could be earlier than ABC but you still use the 12/10/2010 because it is linked to the ID
for x in df['ID'].unique():
tmp = df.loc[df['ID'] == x].reset_index()
df.loc[(df['ID'] == x) & (df['start_date'].isna()), 'start_date'] = tmp['start_date'].min()
df.loc[(df['ID'] == x) & (df['end_date'].isna()), 'end_date'] = tmp['end_date'].max()
I suppose the code is quite clear about what I am trying to do. But if you have any questions don't hesitate do post them I'll do my best to answer.
set up the job
import pandas as pd
data = { 'ID': ['ABC','ABC','ABC','BCD'], 'start_date' : ['12/10/2010', '01/01/2020',None ,'14/02/2020'], 'end_date': ['12/12/2020', '01/01/2021',None ,'14/03/2020'], 'value': [8,9,43,8]}
df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC NaT NaT 43
3 BCD 2020-02-14 2020-03-14 8
do the work
df.start_date = df.groupby('ID')['start_date'].apply(lambda x: x.fillna(x.min()))
df.end_date = df.groupby('ID')['end_date'].apply(lambda x: x.fillna(x.max()))
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC 2010-12-10 2021-01-01 43
3 BCD 2020-02-14 2020-03-14 8