I'm trying to find a method of duplicating all of the data in a row for every day between dates. Start date and end date.
This is the dataset:
Name Medication Start End Dose
James Jameson Depakote 2013-07-01 2015-04-13 1500
James Jameson Depakote 2015-04-14 2015-04-22 1750
James Jameson Depakote 2015-04-23 2015-06-30 2000
James Jameson Naltrexone 2013-07-01 2015-06-30 100
James Jameson Trazodone 2013-07-01 2015-06-30 300
James Jameson Xanex 2013-07-01 2014-02-01 0
James Jameson Xanex 2014-02-02 2015-06-30 1
James Jameson Zoloft 2014-03-21 2015-06-30 100
James Jameson Zoloft 2014-07-01 2014-03-20 50
What I would like to do is repeat the row, incrementing the date, every day between the start and end values. I feel I can do this with some combination of date range and append but I can't find a solution for the entire row.
Example outcome:
Name Medication Start End Dose
James Jameson Depakote 2013-07-01 2015-04-13 1500
James Jameson Depakote 2013-07-02 2015-04-13 1500
James Jameson Depakote 2013-07-03 2015-04-13 1500
James Jameson Depakote 2013-07-x 2015-04-13 1500
.
.
.
James Jameson Depakote 2015-04-13 2015-04-13 1500
James Jameson Depakote 2015-04-14 2015-04-22 1750
James Jameson Depakote 2015-04-15 2015-04-22 1750
James Jameson Depakote 2015-04-16 2015-04-22 1750
James Jameson Depakote 2015-04-x 2015-04-22 1750
.
.
.
James Jameson Depakote 2015-04-22 2015-04-22 1750
James Jameson Depakote 2015-04-23 2015-06-30 2000
James Jameson Naltrexone 2013-07-01 2015-06-30 100
James Jameson Trazodone 2013-07-01 2015-06-30 300
James Jameson Xanex 2013-07-01 2014-02-01 0
James Jameson Xanex 2014-02-02 2015-06-30 1
James Jameson Zoloft 2014-03-21 2015-06-30 100
James Jameson Zoloft 2014-07-01 2014-03-20 50
So ultimately it will create a new duplicate row for every day between the start date and end date in the original dataframe.
Use reindex
with groupby
df = pd.DataFrame({'Name': {0: 'James Jameson', 1: 'James Jameson', 2: 'James Jameson', 3: 'James Jameson', 4: 'James Jameson'}, 'Medication': {0: 'Depakote', 1: 'Depakote', 2: 'Depakote', 3: 'Naltrexone', 4: 'Trazodone'}, 'Start': {0: '2013-07-01', 1: '2015-04-14', 2: '2015-04-23', 3: '2013-07-01', 4: '2013-07-01'}, 'End': {0: '2015-04-13', 1: '2015-04-22', 2: '2015-06-30', 3: '2015-06-30', 4: '2015-06-30'}, 'Dose': {0: 1500, 1: 1750, 2: 2000, 3: 100, 4: 300}})
df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)
df2 = pd.concat([g.set_index('Start').reindex(pd.date_range(g['Start'].min(), g['End'].max(), freq='d'), method='ffill').reset_index().rename({'index':'Start'}, axis=1)
for _, g in df.groupby(['Name','Medication','Dose'])],
axis=0)
A bit more detail in response to the question below -- the list comprehension is effectively just this for
loop put onto one line. There is a lot of text, but it only looks as complex as it does because it isn't storing values in variables -- though doing so may make what's happening a bit clearer. Viz.:
# undoing the list comprehension...
sub_dfs = []
for group_value, group_df in df.groupby(['Name','Medication','Dose']):
print(group_value)
# if not using the group_value for anything, you can just put '_'
# to unpack the .groupby() tuple but not bother naming it
# construct date range
group_start_date = group_df['Start'].min()
group_end_date = group_df['End'].max()
group_date_range = pd.date_range(group_start_date, group_end_date, freq='d')
new_group_df = group_df.set_index('Start').reindex(group_date_range, method='ffill').reset_index().rename({'index':'Start'}, axis=1)
sub_dfs.append(new_group_df)
df2b = pd.concat(sub_dfs, axis=0)