Search code examples
pandasdatetime-series

Convert dataframe with start and end date to daily data


I have one record per ID with start date and end date

id  age state   start_date  end_date
123 18  CA     2/17/2019    5/4/2019
223 24  AZ     1/17/2019    3/4/2019

I want to create a record for each day between the start and end day, so I can join daily activity data to it. The target output would look something like this

id  age state   start_date
123 18  CA      2/17/2019
123 18  CA      2/18/2019
123 18  CA      2/19/2019
123 18  CA      2/20/2019
123 18  CA      2/21/2019
            …
123 18  CA      5/2/2019
123 18  CA      5/3/2019
123 18  CA      5/4/2019

And of course do this for all ids and their respective start dates in the dataset.


Solution

  • Edit: I had to revisit this problem in a project, and looks like using DataFrame.apply with pd.date_range and DataFrame.explode is almost 3x faster:

    df["date"] = df.apply(
        lambda row: pd.date_range(row["start_date"], row["end_date"]),
        axis=1
    )
    df = (
        df.explode("date", ignore_index=True)
        .drop(columns=["start_date", "end_date"])
    )
    

    Output

          id  age state       date
    0    123   18    CA 2019-02-17
    1    123   18    CA 2019-02-18
    2    123   18    CA 2019-02-19
    3    123   18    CA 2019-02-20
    4    123   18    CA 2019-02-21
    ..   ...  ...   ...        ...
    119  223   24    AZ 2019-02-28
    120  223   24    AZ 2019-03-01
    121  223   24    AZ 2019-03-02
    122  223   24    AZ 2019-03-03
    123  223   24    AZ 2019-03-04
    
    [124 rows x 4 columns]
    

    Original answer:

    melt, GroupBy, resample & ffill

    First we melt (unpivot) your two date columns to one. Then we resample on day basis:

    melt = df.melt(id_vars=['id', 'age', 'state'], value_name='date').drop('variable', axis=1)
    melt['date'] = pd.to_datetime(melt['date'])
    
    melt = melt.groupby('id').apply(lambda x: x.set_index('date').resample('d').first())\
               .ffill()\
               .reset_index(level=1)\
               .reset_index(drop=True)
    

    Output

              date     id   age state
    0   2019-02-17  123.0  18.0    CA
    1   2019-02-18  123.0  18.0    CA
    2   2019-02-19  123.0  18.0    CA
    3   2019-02-20  123.0  18.0    CA
    4   2019-02-21  123.0  18.0    CA
    ..         ...    ...   ...   ...
    119 2019-02-28  223.0  24.0    AZ
    120 2019-03-01  223.0  24.0    AZ
    121 2019-03-02  223.0  24.0    AZ
    122 2019-03-03  223.0  24.0    AZ
    123 2019-03-04  223.0  24.0    AZ
    
    [124 rows x 4 columns]