Search code examples
pandasgroup-by

Inserting Rows with Consecutive Dates for Different Projects with Different Start and End Dates in pandas


I have a df called data of project records that looks somewhat like this:

project = ['Project 1','Project 1','Project 1','Project 1','Project 2','Project 2','Project 2','Project 3','Project 3','Project 3','Project 4','Project 5','Project 5','Project 5']
date = ['2010-10-12','2010-10-15','2010-10-20','2010-10-22','2012-05-05','2012-05-07','2012-05-10','2018-01-01','2018-01-05','2018-01-06','2019-10-02','2010-02-02','2010-02-04','2010-02-07']
date = pd.to_datetime(date)
hours = [0,1,0,2,4,0,2,1,0,2,4,2,4,3]
taskcount = [0,1,2,0,1,2,0,1,2,0,1,0,0,2]
data = pd.DataFrame({'Project':project, 'Date':date, 'Hours':hours,'TaskCount':taskcount})

The column Hours shows the number of hours worked on the particular project on the given date, while the column TaskCount gives a count of the number of tasks completed for that project on the given date.

I have a second df called project_duration containing info about the duration of each of the projects in the df data:

column_DateFirstRecord = data.groupby('Project').apply(lambda df: df.Date.min())
column_DateLastRecord =  data.groupby('Project').apply(lambda df: df.Date.max())
project_duration = pd.concat([column_DateFirstRecord, column_DateLastRecord], axis=1)
project_duration.columns = ['DateFirstRecord', 'DateLastRecord']
project_duration = project_duration.assign(ProjectLength = (project_duration.DateLastRecord - project_duration.DateFirstRecord))

For each project in the df data, I need to append rows to the df data with the missing dates from the date of the first record to the date of the last record for that particular project. For instance, for Project 1, I need to add rows to the df data for Jan 13-16, 17-19, and 21, 2010. Note that these new rows should have the value 0 in the columns Hours and TaskCount.

The output that I'm looking for should look like the df data_output that I've created below:

date_output = ['2010-10-12', '2010-10-13','2010-10-14','2010-10-15','2010-10-16','2010-10-17','2010-10-18','2010-10-19','2010-10-20','2010-10-21','2010-10-22','2012-05-05','2012-05-06','2012-05-07','2012-05-08','2012-05-09','2012-05-10','2018-01-01', '2018-01-02','2018-01-03','2018-01-04','2018-01-05','2018-01-06','2019-10-02','2010-02-02','2010-02-03','2010-02-04','2010-02-05','2010-02-06','2010-02-07']
date_output = pd.to_datetime(date_output)
project_output = ['Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 1','Project 2','Project 2','Project 2','Project 2','Project 2','Project 2','Project 3','Project 3','Project 3','Project 3','Project 3','Project 3','Project 4','Project 5','Project 5','Project 5','Project 5','Project 5','Project 5']
hours_output = [0,0,0,1,0,0,0,0,0,0,2,4,0,0,0,0,2,1,0,0,0,0,2,4,2,0,4,0,0,3]
taskcount_output = [0,0,0,1,0,0,0,0,2,0,0,1,0,2,0,0,0,1,0,0,0,2,0,1,0,0,0,0,0,2]
data_output = pd.DataFrame({'Project':project_output, 'Date':date_output, 'Hours':hours_output,'TaskCount':taskcount_output})

I should also note that the real dfs that I'm working with are very large - they comprise about 278,000 rows - so I'm hoping to find an efficient solution. I tried the method detailed in this StackOverflow post: Pandas filling missing dates and values within group , but it didn't allow for the different start and end dates for each project.


Solution

  • One option:

    (data.groupby('Project')
         .apply(lambda g: g.set_index('Date')
                           .reindex(pd.date_range(project_duration.loc[g.name, 'DateFirstRecord'],
                                                  project_duration.loc[g.name, 'DateLastRecord']
                                                 ).rename('Date'),
                                    fill_value=0
                                   )
                )[['Hours', 'TaskCount']]
          .reset_index()
    )
    

    Output:

          Project       Date  Hours  TaskCount
    0   Project 1 2010-10-12      0          0
    1   Project 1 2010-10-13      0          0
    2   Project 1 2010-10-14      0          0
    3   Project 1 2010-10-15      1          1
    4   Project 1 2010-10-16      0          0
    5   Project 1 2010-10-17      0          0
    6   Project 1 2010-10-18      0          0
    7   Project 1 2010-10-19      0          0
    8   Project 1 2010-10-20      0          2
    9   Project 1 2010-10-21      0          0
    10  Project 1 2010-10-22      2          0
    11  Project 2 2012-05-05      4          1
    12  Project 2 2012-05-06      0          0
    13  Project 2 2012-05-07      0          2
    14  Project 2 2012-05-08      0          0
    15  Project 2 2012-05-09      0          0
    16  Project 2 2012-05-10      2          0
    17  Project 3 2018-01-01      1          1
    18  Project 3 2018-01-02      0          0
    19  Project 3 2018-01-03      0          0
    20  Project 3 2018-01-04      0          0
    21  Project 3 2018-01-05      0          2
    22  Project 3 2018-01-06      2          0
    23  Project 4 2019-10-02      4          1
    24  Project 5 2010-02-02      2          0
    25  Project 5 2010-02-03      0          0
    26  Project 5 2010-02-04      4          0
    27  Project 5 2010-02-05      0          0
    28  Project 5 2010-02-06      0          0
    29  Project 5 2010-02-07      3          2