I have pandas DataFrame:
start_date finish_date progress_id
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00 a387ab916f402cb3fbfffd29f68fd0ce
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00 3b9dce04f32da32763124602557f92a3
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00 73e17a05355852fe65b785c82c37d1ad
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00 cc3eb34ae49c719648352c4175daee88
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00 04ace4fe130d90c801e24eea13ee808e
I converted columns to datetime.date because I don't need time in df:
df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date
So, I need a new column which will be contain year-month if start_date
and finish_date
have same month. And if different put range between them. For example start_date = 06-2020, finish_date = 08-2020 the result is [06-2020, 07-2020, 08-2020]. Then I need to explode
it by column.
I tried:
df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')
but as a result I had many NaT's in the column.
Any solutions will be great.
One alternative is the following. Assume you have the following dataframe, df:
start_date finish_date \
0 2018-06-23 08:28:50.681065+00 2018-06-23 08:28:52.439542+00
1 2019-03-18 14:23:17.328374+00 2019-03-18 14:54:50.979612+00
2 2019-07-09 09:18:46.19862+00 2019-07-11 08:03:09.222385+00
3 2018-07-27 15:39:17.666629+00 2018-07-27 16:13:55.086871+00
4 2019-04-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
5 2019-05-24 18:42:40.272854+00 2019-04-24 18:44:57.507857+00
progress_id
0 a387ab916f402cb3fbfffd29f68fd0ce
1 3b9dce04f32da32763124602557f92a3
2 73e17a05355852fe65b785c82c37d1ad
3 cc3eb34ae49c719648352c4175daee88
4 04ace4fe130d90c801e24eea13ee808e
5 04ace4fe130d90c801e24eea13ee808e
It is the same you shared pllus one row where the dates (year and month) differ.
Then applying this:
df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
def range(row):
if row['Start_M_Y'] == row['finish_M_Y']:
val = row['Start_M_Y']
elif row['Start_M_Y'] != row['finish_M_Y']:
val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
else:
val = -1
return val
df['Range'] = df.apply(range, axis=1)
df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
gives you
start_date finish_date \
0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00
1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00
2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00
3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00
4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00
progress_id Range
0 a387ab916f402cb3fbfffd29f68fd0ce 2018-06
1 3b9dce04f32da32763124602557f92a3 2019-03
2 73e17a05355852fe65b785c82c37d1ad 2019-07
3 cc3eb34ae49c719648352c4175daee88 2018-07
4 04ace4fe130d90c801e24eea13ee808e 2019-04
5 04ace4fe130d90c801e24eea13ee808e 2019-05-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-06-30 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-07-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-08-31 00:00:00
5 04ace4fe130d90c801e24eea13ee808e 2019-09-30 00:00:00