I've got an issue with pandas resample function when trying resample a time series. My program fetches daily traffic data two years back from today and populates it in a .csv file. Resampling the data initially functioned well but recently it has started acting up. When I try to resample the daily data into weekly, monthly or quarterly frequency, pandas seems to randomly give out-of sample (non-existent) data from both sides of the actual range.
I first create a Pandas data frame from the csv file:
data = pd.read_csv('Trucks.csv')
data['Date'] = pd.to_datetime(data['Date'], infer_datetime_format=True)
data.set_index('Date',inplace=True)
data['Modified Total Trucks'] = data['Modified Total Trucks'].astype(int)
Here's a sample of the data:
Date Total Trucks Modified Total Trucks Solo Trucks Semi Trucks Full Trucks
2020-07-04 3898 2535 805 2281 812
2020-06-04 4125 2740 927 2378 820
2020-05-04 730 569 234 431 65
2020-04-04 465 354 145 270 50
2020-03-04 3501 2377 812 2051 638
2020-02-04 3594 2334 754 2081 759
...
2018-04-13 3243 2333 819 1978 446
2018-12-04 3402 2394 767 2144 491
2018-11-04 3559 2543 859 2209 491
2018-10-04 3492 2473 813 2182 497
2018-09-04 3733 2672 902 2321 510
I then try to resample the data:
DataWeekly = data.resample('1W').sum()
DataMonthly = data.resample('1M').sum()
DataQuarterly = data.resample('1Q').sum()
However, the resampled data frames have the wrong range and sometimes incorrect values. Here's an example of the monthly set:
Date Total Trucks Modified Total Trucks Solo Trucks Semi Trucks Full Trucks
2018-01-31 15553 11119 3842 9531 2180
2018-02-28 18488 13113 4497 11291 2700
2018-03-31 21355 15177 5134 13176 3045
2018-04-30 67785 48478 16524 41893 9368
2018-05-31 72390 51690 17666 44594 10130
2018-06-30 63877 45356 14938 40000 8939
2018-07-31 64846 46437 16108 39703 9035
2018-08-31 68352 49036 16905 42081 9366
2018-09-30 64629 46379 15963 39842 8824
2018-10-31 68093 48609 16806 41643 9644
2018-11-30 74643 53052 18581 45073 10989
2018-12-31 60270 43042 15030 36649 8591
2019-01-31 76866 55463 18994 47789 10083
2019-02-28 74705 53744 18170 46674 9861
2019-03-31 78664 56562 19108 49144 10412
2019-04-30 77760 56175 19356 48224 10180
2019-05-31 88033 63219 22049 53859 12125
2019-06-30 70370 50626 17448 43454 9468
2019-07-31 76014 54531 18698 46947 10369
2019-08-31 83509 60418 21600 50653 11256
2019-09-30 77289 55375 19097 47517 10675
2019-10-31 83514 60021 20761 51397 11356
2019-11-30 81383 58460 20550 49551 11282
2019-12-31 68307 49172 17092 41990 9225
2020-01-31 59448 42384 14547 36472 8429
2020-02-29 53862 38544 13687 32457 7718
2020-03-31 62950 43478 14930 37403 10617
2020-04-30 7796 5645 1968 4811 1017
2020-05-31 7983 5840 2053 4951 979
2020-06-30 11200 7918 2785 6710 1705
2020-07-31 10998 7673 2576 6691 1731
2020-08-31 4602 3323 1155 2838 609
2020-09-30 7980 5794 1991 4981 1008
2020-10-31 9759 7060 2464 6012 1283
2020-11-30 7762 5595 1906 4836 1020
2020-12-31 7642 5412 1790 4760 1092
I would expect the resample to be:
2018-04-30 67785 48478 16524 41893 9368
2018-05-31 72390 51690 17666 44594 10130
2018-06-30 63877 45356 14938 40000 8939
2018-07-31 64846 46437 16108 39703 9035
2018-08-31 68352 49036 16905 42081 9366
2018-09-30 64629 46379 15963 39842 8824
2018-10-31 68093 48609 16806 41643 9644
2018-11-30 74643 53052 18581 45073 10989
2018-12-31 60270 43042 15030 36649 8591
2019-01-31 76866 55463 18994 47789 10083
2019-02-28 74705 53744 18170 46674 9861
2019-03-31 78664 56562 19108 49144 10412
2019-04-30 77760 56175 19356 48224 10180
2019-05-31 88033 63219 22049 53859 12125
2019-06-30 70370 50626 17448 43454 9468
2019-07-31 76014 54531 18698 46947 10369
2019-08-31 83509 60418 21600 50653 11256
2019-09-30 77289 55375 19097 47517 10675
2019-10-31 83514 60021 20761 51397 11356
2019-11-30 81383 58460 20550 49551 11282
2019-12-31 68307 49172 17092 41990 9225
2020-01-31 59448 42384 14547 36472 8429
2020-02-29 53862 38544 13687 32457 7718
2020-03-31 62950 43478 14930 37403 10617
2020-04-30 7796 5645 1968 4811 1017
What am I missing? Many thanks in advance!
I think this is a problem with US vs ISO (European) time format, i.e. YYYY-DD-MM
vs YYYY-MM-DD
, it looks like it reads 2018-01-04
as 4th of January and puts it into the 2018-01-31
block (i.e. January 2018).
you want to set the option dayfirst=True
in your pd.to_datetime
call, see the Pandas doc for more details.