I am looking for a way I can apply a count() function only to a specific range of dates. I have attached the code I am using right now. As you can see, I have already generated the specific range of dates I want but I am unsure of how to apply the count function to this range and then have it still generate a plot.
The sample of data below is only a few of the columns - there are 16 but I only need to use date_report.
Data from excel sample:
'''
sex---------country-------date_report
M Canada 03-01-2020
F Canada 03-01-2020
M Canada 03-02-2020
F Canada 03-02-2020
M Canada 03-02-2020
M Canada 03-03-2020
F Canada 03-03-2020
M Canada 03-04-2020
F Canada 03-04-2020
M Canada 03-04-2020
'''
I need to calculate the number of cases from March 1 - July 10/2020 from the date_report column. There are duplicate dates in the rows, those need to be summed up so there is one total value per date (2 for 03-01-2020, 4 for 3-02-2020 and so on)
After reading the excel file and importing pandas and datetime:
'''
day_first=datetime.date(2020, 3, 1)
day_last=datetime.date(2020, 7, 10)
delta = (day_last - day_first)
print(delta.days)
for i in range(delta.days + 1):
all_dates = day_first + datetime.timedelta(+i)
print(all_dates) # This gives me the range of dates I am looking for.
date_count=df.groupby('date_report').date_report.count()
print(date_count)
date_count.plot(kind='line') # This plot goes from the first date in January until the end of the list, I just want March 1-July 10.
'''
Like I said, the goal is to have a line plot from dates March 1-July 10 on the x axis, showing the total number of cases per day on the y axis. Any help would be greatly appreciated!
You just have to reduce the initial dataset to the dates you're interested in before performing the groupby:
import matplotlib.pyplot as plt
# select data between dates of interest
sub_df = df.loc[df['date_report'].between(first_day, last_day), :]
date_count = sub_df.groupby('date_report').date_report.count()
date_count.plot(kind='line')
plt.show()
If you need to change the way the dates are stored in the dataframe for compatibility issues between various types (datetime.datetime
, datetime64[ns]
), you can convert them to the versatile pd.Timestamp
objects with
df['date_report'] = [pd.Timestamp(d) for d in df['date_report']]
or, in this case, the bounding datetimes:
first_day = pd.Timestamp(first_day)
last_day = ...