I have this data:
I need to create a new column in python that returns the number of days between the End Date of the task Technical Analysis and the End Date of the task Approval for the different Project Id's.
I've tried this approach:
approval_date = df.loc[df['Task Name'] == 'Approval', 'End Date'].values[0]
technical_analysis_date = df.loc[df['Task Name'] == 'Technical Analysis', 'End Date'].values[0]
nr_days_negotiation = (pd.to_datetime(approval_date) - pd.to_datetime(technical_analysis_date)).days
df['Nr_days_negotiation'] = nr_days_negotiation
But it's returning the same value of number of days for all the different project id's...
What I want to achieve is this:
How can I do this?
Also, what about if I had more tasks in the column Task Name, and I needed to create the column Nr_days_Negotiation that calculates the number of days between Technical Analysis and Approval tasks only for each Project Id?
(This is the example data in this case): Example data
Code
g = pd.to_datetime(df['End Date'], format='%d/%m/%Y').groupby(df['Project Id'])
df['Nr_days_negotiation'] = g.transform('max').sub(g.transform('min')).dt.days
df
Project Id Task Name End Date Nr_days_negotiation
0 Q6730168 Approval 13/11/2023 10
1 Q6730168 Technical Analysis 03/11/2023 10
2 Q6737180 Technical Analysis 05/02/2024 22
3 Q6737180 Approval 27/02/2024 22
4 Q6747678 Approval 30/11/2023 14
5 Q6747678 Technical Analysis 16/11/2023 14
6 Q6732796 Technical Analysis 20/11/2023 21
7 Q6732796 Approval 11/12/2023 21
Example Code
import pandas as pd
data = {
"Project Id": ["Q6730168", "Q6730168", "Q6737180", "Q6737180", "Q6747678", "Q6747678", "Q6732796", "Q6732796"],
"Task Name": ["Approval", "Technical Analysis", "Technical Analysis", "Approval", "Approval", "Technical Analysis", "Technical Analysis", "Approval"],
"End Date": ["13/11/2023", "03/11/2023", "05/02/2024", "27/02/2024", "30/11/2023", "16/11/2023", "20/11/2023", "11/12/2023"]
}
df = pd.DataFrame(data)