Search code examples
pandaspython-datetime

pandas: get range between date columns


I have pandas DataFrame:

    start_date                      finish_date                     progress_id
0   2018-06-23 08:28:50.681065+00   2018-06-23 08:28:52.439542+00   a387ab916f402cb3fbfffd29f68fd0ce
1   2019-03-18 14:23:17.328374+00   2019-03-18 14:54:50.979612+00   3b9dce04f32da32763124602557f92a3
2   2019-07-09 09:18:46.19862+00    2019-07-11 08:03:09.222385+00   73e17a05355852fe65b785c82c37d1ad
3   2018-07-27 15:39:17.666629+00   2018-07-27 16:13:55.086871+00   cc3eb34ae49c719648352c4175daee88
4   2019-04-24 18:42:40.272854+00   2019-04-24 18:44:57.507857+00   04ace4fe130d90c801e24eea13ee808e

I converted columns to datetime.date because I don't need time in df:

df['start_date'] = pd.to_datetime(df['start_date']).dt.date
df['finish_date'] = pd.to_datetime(df['finish_date']).dt.date

So, I need a new column which will be contain year-month if start_date and finish_date have same month. And if different put range between them. For example start_date = 06-2020, finish_date = 08-2020 the result is [06-2020, 07-2020, 08-2020]. Then I need to explode it by column. I tried:

df['range'] = df.apply(lambda x: pd.date_range(x['start_date'], x['finish_date'], freq="M"), axis=1)
df = df.explode('range')

but as a result I had many NaT's in the column.

Any solutions will be great.


Solution

  • One alternative is the following. Assume you have the following dataframe, df:

                         start_date                    finish_date  \
    0  2018-06-23 08:28:50.681065+00  2018-06-23 08:28:52.439542+00   
    1  2019-03-18 14:23:17.328374+00  2019-03-18 14:54:50.979612+00   
    2   2019-07-09 09:18:46.19862+00  2019-07-11 08:03:09.222385+00   
    3  2018-07-27 15:39:17.666629+00  2018-07-27 16:13:55.086871+00   
    4  2019-04-24 18:42:40.272854+00  2019-04-24 18:44:57.507857+00   
    5  2019-05-24 18:42:40.272854+00  2019-04-24 18:44:57.507857+00   
    
                            progress_id  
    0  a387ab916f402cb3fbfffd29f68fd0ce  
    1  3b9dce04f32da32763124602557f92a3  
    2  73e17a05355852fe65b785c82c37d1ad  
    3  cc3eb34ae49c719648352c4175daee88  
    4  04ace4fe130d90c801e24eea13ee808e  
    5  04ace4fe130d90c801e24eea13ee808e  
    

    It is the same you shared pllus one row where the dates (year and month) differ.

    Then applying this:

    df['start_date'] = pd.to_datetime(df['start_date'],format='%Y-%m-%d')
    df['finish_date'] = pd.to_datetime(df['finish_date'],format='%Y-%m-%d')
    
    df['finish_M_Y'] = df['finish_date'].dt.strftime('%Y-%m')
    df['Start_M_Y'] = df['start_date'].dt.strftime('%Y-%m')
    
    def range(row):
        if row['Start_M_Y'] == row['finish_M_Y']:
            val = row['Start_M_Y'] 
        elif  row['Start_M_Y'] != row['finish_M_Y']:
            val = pd.date_range(row['Start_M_Y'] , row['finish_M_Y'], freq='M')
        else:
            val = -1
        return val
    
    df['Range'] = df.apply(range, axis=1)
    df.explode('Range').drop(['Start_M_Y', 'finish_M_Y'], axis=1)
    
    

    gives you

                            start_date                      finish_date  \
    0 2018-06-23 08:28:50.681065+00:00 2018-06-23 08:28:52.439542+00:00   
    1 2019-03-18 14:23:17.328374+00:00 2019-03-18 14:54:50.979612+00:00   
    2 2019-07-09 09:18:46.198620+00:00 2019-07-11 08:03:09.222385+00:00   
    3 2018-07-27 15:39:17.666629+00:00 2018-07-27 16:13:55.086871+00:00   
    4 2019-04-24 18:42:40.272854+00:00 2019-04-24 18:44:57.507857+00:00   
    5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00   
    5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00   
    5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00   
    5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00   
    5 2019-05-24 18:42:40.272854+00:00 2019-10-24 18:44:57.507857+00:00   
    
                            progress_id                Range  
    0  a387ab916f402cb3fbfffd29f68fd0ce              2018-06  
    1  3b9dce04f32da32763124602557f92a3              2019-03  
    2  73e17a05355852fe65b785c82c37d1ad              2019-07  
    3  cc3eb34ae49c719648352c4175daee88              2018-07  
    4  04ace4fe130d90c801e24eea13ee808e              2019-04  
    5  04ace4fe130d90c801e24eea13ee808e  2019-05-31 00:00:00  
    5  04ace4fe130d90c801e24eea13ee808e  2019-06-30 00:00:00  
    5  04ace4fe130d90c801e24eea13ee808e  2019-07-31 00:00:00  
    5  04ace4fe130d90c801e24eea13ee808e  2019-08-31 00:00:00  
    5  04ace4fe130d90c801e24eea13ee808e  2019-09-30 00:00:00