I have a sample data that looks like this.
Column DateDuration was calculated in Excel, following below logic:
I would like to calculate DateDuration in Python but do not know how to do about this.
Types of data in Python:
I am new to Python. Any help would be greatly appreciated!!
import pandas as pd
import numpy as np
df['FirstDate'] = pd.to_datetime(df['FirstDate'])
df['SecondDate'] = pd.to_datetime(df['SecondDate'])
df['DayDifference2'] = (df['SecondDate']) -(df['FirstDate'])
df['DayDifference3'] = (df['ThirdDate']) -(df['FirstDate'])
df['DayDifference4'] = (df['FourthDate']) -(df['FirstDate'])
x = df['DayDifference2'].dt.days
y = df['DayDifference3'].dt.days
z = df['DayDifference4'].dt.days
condlist = [x<28, x>=28]
choicelist = [(df['ThirdDate']) -(df['FirstDate']), (df['SecondDate']) -(df['FirstDate'])]
np.select(condlist, choicelist)
My data:
ID | FirstDate | SecondDate | ThirdDate | FourthDate | DateDuration |
---|---|---|---|---|---|
2914300 | 2021-09-23 | 2021-10-07 | 2021-11-29 | 2021-12-20 | 67 |
3893461 | 2021-09-08 | 2021-10-06 | 2022-04-07 | 211 | |
4343075 | 2021-06-23 | 2021-09-27 | 96 | ||
4347772 | 2021-06-23 | 2021-09-27 | 96 | ||
4551963 | 2021-08-02 | 2021-10-14 | 2022-03-11 | 73 | |
4893324 | 2021-09-30 | 2021-10-01 | 2022-03-03 | 2022-03-10 | 154 |
5239991 | 2021-06-24 | 2021-08-26 | 2021-09-25 | 2022-02-03 | 63 |
8454947 | 2021-09-28 | 2021-10-05 | 7 | ||
8581390 | 2021-09-27 | 2022-03-21 | 2022-03-25 | 175 | |
8763766 | 2021-09-20 | 2021-10-04 | 2021-12-09 | 80 | |
9144185 | 2021-06-18 | 2021-06-23 | 5 | ||
9967685 | 2021-09-13 | 2021-10-29 | 2022-02-07 | 2022-03-23 | 46 |
11367560 | 2021-08-31 | 2021-09-28 | 2021-10-21 | 2022-02-11 | 51 |
import pandas as pd
import numpy as np
df = pd.read_csv('date_example.csv')
df.loc[:,'FirstDate':'FourthDate'] = df.loc[:,'FirstDate':'FourthDate'].astype('datetime64[ns]')
df
NaT
is a missing value of datetime64[ns]
type
Conditions & Choices
conditions = [
(df['SecondDate'] - df['FirstDate']).dt.days >= 28,
((df['SecondDate'] - df['FirstDate']).dt.days < 28) & df['ThirdDate'].isna(),
((df['SecondDate'] - df['FirstDate']).dt.days < 28) & df['ThirdDate'].notna() & ((df['ThirdDate'] - df['FirstDate']).dt.days >= 28),
((df['SecondDate'] - df['FirstDate']).dt.days < 28) & df['ThirdDate'].notna() & ((df['ThirdDate'] - df['FirstDate']).dt.days < 28) & df['FourthDate'].isna(),
((df['SecondDate'] - df['FirstDate']).dt.days < 28) & df['ThirdDate'].notna() & ((df['ThirdDate'] - df['FirstDate']).dt.days < 28) & df['FourthDate'].notna()
]
choices = [
(df['SecondDate'] - df['FirstDate']).dt.days,
(df['SecondDate'] - df['FirstDate']).dt.days,
(df['ThirdDate'] - df['FirstDate']).dt.days,
(df['ThirdDate'] - df['FirstDate']).dt.days,
(df['FourthDate'] - df['FirstDate']).dt.days
]
df['Duration'] = np.select(conditions, choices)
df
Discussion:
There are some differences, e.g., second row, ID = 3893461
, according to your conditions(DateDuration between SecondDate and FirstDate >= 28, then DateDuration = SecondDate - FirstDate.), SecondDate
- FirstDate
of ID = 3893461
is 28
, same thing happened on last row, ID = 11367560