Search code examples
pythonpandasduplicatesdate-rangemedical

Insert Duplicate Row Data between 2 Dates Pandas


I'm trying to find a method of duplicating all of the data in a row for every day between dates. Start date and end date.

This is the dataset:

Name            Medication  Start       End         Dose
James Jameson   Depakote    2013-07-01  2015-04-13  1500
James Jameson   Depakote    2015-04-14  2015-04-22  1750
James Jameson   Depakote    2015-04-23  2015-06-30  2000
James Jameson   Naltrexone  2013-07-01  2015-06-30  100
James Jameson   Trazodone   2013-07-01  2015-06-30  300
James Jameson   Xanex       2013-07-01  2014-02-01  0
James Jameson   Xanex       2014-02-02  2015-06-30  1
James Jameson   Zoloft      2014-03-21  2015-06-30  100
James Jameson   Zoloft      2014-07-01  2014-03-20  50

What I would like to do is repeat the row, incrementing the date, every day between the start and end values. I feel I can do this with some combination of date range and append but I can't find a solution for the entire row.

Example outcome:

Name            Medication  Start       End         Dose
James Jameson   Depakote    2013-07-01  2015-04-13  1500
James Jameson   Depakote    2013-07-02  2015-04-13  1500
James Jameson   Depakote    2013-07-03  2015-04-13  1500
James Jameson   Depakote    2013-07-x   2015-04-13  1500
.
.
.
James Jameson   Depakote    2015-04-13  2015-04-13  1500
James Jameson   Depakote    2015-04-14  2015-04-22  1750
James Jameson   Depakote    2015-04-15  2015-04-22  1750
James Jameson   Depakote    2015-04-16  2015-04-22  1750
James Jameson   Depakote    2015-04-x   2015-04-22  1750
.
.
.
James Jameson   Depakote    2015-04-22  2015-04-22  1750
James Jameson   Depakote    2015-04-23  2015-06-30  2000
James Jameson   Naltrexone  2013-07-01  2015-06-30  100
James Jameson   Trazodone   2013-07-01  2015-06-30  300
James Jameson   Xanex       2013-07-01  2014-02-01  0
James Jameson   Xanex       2014-02-02  2015-06-30  1
James Jameson   Zoloft      2014-03-21  2015-06-30  100
James Jameson   Zoloft      2014-07-01  2014-03-20  50

So ultimately it will create a new duplicate row for every day between the start date and end date in the original dataframe.


Solution

  • Use reindex with groupby

    df = pd.DataFrame({'Name': {0: 'James Jameson', 1: 'James Jameson', 2: 'James Jameson', 3: 'James Jameson', 4: 'James Jameson'}, 'Medication': {0: 'Depakote', 1: 'Depakote', 2: 'Depakote', 3: 'Naltrexone', 4: 'Trazodone'}, 'Start': {0: '2013-07-01', 1: '2015-04-14', 2: '2015-04-23', 3: '2013-07-01', 4: '2013-07-01'}, 'End': {0: '2015-04-13', 1: '2015-04-22', 2: '2015-06-30', 3: '2015-06-30', 4: '2015-06-30'}, 'Dose': {0: 1500, 1: 1750, 2: 2000, 3: 100, 4: 300}})
    df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)
    
    
    df2 = pd.concat([g.set_index('Start').reindex(pd.date_range(g['Start'].min(), g['End'].max(), freq='d'), method='ffill').reset_index().rename({'index':'Start'}, axis=1)
                     for _, g in df.groupby(['Name','Medication','Dose'])],
                    axis=0)
    

    A bit more detail in response to the question below -- the list comprehension is effectively just this for loop put onto one line. There is a lot of text, but it only looks as complex as it does because it isn't storing values in variables -- though doing so may make what's happening a bit clearer. Viz.:

    # undoing the list comprehension...
    sub_dfs = []
    for group_value, group_df in df.groupby(['Name','Medication','Dose']):
        print(group_value)
        # if not using the group_value for anything, you can just put '_' 
        # to unpack the .groupby() tuple but not bother naming it
        
        # construct date range
        group_start_date = group_df['Start'].min()
        group_end_date = group_df['End'].max()
        group_date_range = pd.date_range(group_start_date, group_end_date, freq='d')
        
        new_group_df = group_df.set_index('Start').reindex(group_date_range, method='ffill').reset_index().rename({'index':'Start'}, axis=1)
        sub_dfs.append(new_group_df)
        
    df2b = pd.concat(sub_dfs, axis=0)