Search code examples
pythonpandasdatetimedelta

How do I count the number of repeated dates, using pandas & datetime, within an imported excel column?


I am looking for a way I can apply a count() function only to a specific range of dates. I have attached the code I am using right now. As you can see, I have already generated the specific range of dates I want but I am unsure of how to apply the count function to this range and then have it still generate a plot.

The sample of data below is only a few of the columns - there are 16 but I only need to use date_report.

Data from excel sample:

'''

   sex---------country-------date_report

    M           Canada       03-01-2020

    F           Canada       03-01-2020

    M           Canada       03-02-2020

    F           Canada       03-02-2020

    M           Canada       03-02-2020

    M           Canada       03-03-2020

    F           Canada       03-03-2020

    M           Canada       03-04-2020

    F           Canada       03-04-2020

    M           Canada       03-04-2020

'''

I need to calculate the number of cases from March 1 - July 10/2020 from the date_report column. There are duplicate dates in the rows, those need to be summed up so there is one total value per date (2 for 03-01-2020, 4 for 3-02-2020 and so on)

After reading the excel file and importing pandas and datetime:

'''

    day_first=datetime.date(2020, 3, 1)
    day_last=datetime.date(2020, 7, 10)
    delta = (day_last - day_first)
    print(delta.days)

    for i in range(delta.days + 1):
        all_dates = day_first + datetime.timedelta(+i)
        print(all_dates)    # This gives me the range of dates I am looking for. 

    date_count=df.groupby('date_report').date_report.count()

    print(date_count)

    date_count.plot(kind='line') # This plot goes from the first date in January until the end of the list, I just want March 1-July 10. 

'''

Like I said, the goal is to have a line plot from dates March 1-July 10 on the x axis, showing the total number of cases per day on the y axis. Any help would be greatly appreciated!


Solution

  • You just have to reduce the initial dataset to the dates you're interested in before performing the groupby:

    import matplotlib.pyplot as plt
    
    # select data between dates of interest
    sub_df = df.loc[df['date_report'].between(first_day, last_day), :]
    
    date_count = sub_df.groupby('date_report').date_report.count()
    
    date_count.plot(kind='line')
    plt.show()
    

    If you need to change the way the dates are stored in the dataframe for compatibility issues between various types (datetime.datetime, datetime64[ns]), you can convert them to the versatile pd.Timestamp objects with

    df['date_report'] = [pd.Timestamp(d) for d in df['date_report']]
    

    or, in this case, the bounding datetimes:

    first_day = pd.Timestamp(first_day)
    last_day = ...