Search code examples
pythonpandastime-seriesresampling

Why is Pandas resample sampling out of sample?


I've got an issue with pandas resample function when trying resample a time series. My program fetches daily traffic data two years back from today and populates it in a .csv file. Resampling the data initially functioned well but recently it has started acting up. When I try to resample the daily data into weekly, monthly or quarterly frequency, pandas seems to randomly give out-of sample (non-existent) data from both sides of the actual range.

I first create a Pandas data frame from the csv file:

    data = pd.read_csv('Trucks.csv')
data['Date'] =  pd.to_datetime(data['Date'], infer_datetime_format=True)
data.set_index('Date',inplace=True)
data['Modified Total Trucks'] = data['Modified Total Trucks'].astype(int)

Here's a sample of the data:

Date    Total Trucks     Modified Total Trucks  Solo Trucks Semi Trucks Full Trucks
2020-07-04  3898    2535    805 2281    812
2020-06-04  4125    2740    927 2378    820
2020-05-04  730 569 234 431 65
2020-04-04  465 354 145 270 50
2020-03-04  3501    2377    812 2051    638
2020-02-04  3594    2334    754 2081    759
...
2018-04-13  3243    2333    819 1978    446
2018-12-04  3402    2394    767 2144    491
2018-11-04  3559    2543    859 2209    491
2018-10-04  3492    2473    813 2182    497
2018-09-04  3733    2672    902 2321    510

I then try to resample the data:

DataWeekly = data.resample('1W').sum()
DataMonthly = data.resample('1M').sum()
DataQuarterly = data.resample('1Q').sum()

However, the resampled data frames have the wrong range and sometimes incorrect values. Here's an example of the monthly set:

Date    Total Trucks     Modified Total Trucks  Solo Trucks Semi Trucks Full Trucks
2018-01-31  15553   11119   3842    9531    2180
2018-02-28  18488   13113   4497    11291   2700
2018-03-31  21355   15177   5134    13176   3045
2018-04-30  67785   48478   16524   41893   9368
2018-05-31  72390   51690   17666   44594   10130
2018-06-30  63877   45356   14938   40000   8939
2018-07-31  64846   46437   16108   39703   9035
2018-08-31  68352   49036   16905   42081   9366
2018-09-30  64629   46379   15963   39842   8824
2018-10-31  68093   48609   16806   41643   9644
2018-11-30  74643   53052   18581   45073   10989
2018-12-31  60270   43042   15030   36649   8591
2019-01-31  76866   55463   18994   47789   10083
2019-02-28  74705   53744   18170   46674   9861
2019-03-31  78664   56562   19108   49144   10412
2019-04-30  77760   56175   19356   48224   10180
2019-05-31  88033   63219   22049   53859   12125
2019-06-30  70370   50626   17448   43454   9468
2019-07-31  76014   54531   18698   46947   10369
2019-08-31  83509   60418   21600   50653   11256
2019-09-30  77289   55375   19097   47517   10675
2019-10-31  83514   60021   20761   51397   11356
2019-11-30  81383   58460   20550   49551   11282
2019-12-31  68307   49172   17092   41990   9225
2020-01-31  59448   42384   14547   36472   8429
2020-02-29  53862   38544   13687   32457   7718
2020-03-31  62950   43478   14930   37403   10617
2020-04-30  7796    5645    1968    4811    1017
2020-05-31  7983    5840    2053    4951    979
2020-06-30  11200   7918    2785    6710    1705
2020-07-31  10998   7673    2576    6691    1731
2020-08-31  4602    3323    1155    2838    609
2020-09-30  7980    5794    1991    4981    1008
2020-10-31  9759    7060    2464    6012    1283
2020-11-30  7762    5595    1906    4836    1020
2020-12-31  7642    5412    1790    4760    1092

I would expect the resample to be:

2018-04-30  67785   48478   16524   41893   9368
    2018-05-31  72390   51690   17666   44594   10130
    2018-06-30  63877   45356   14938   40000   8939
    2018-07-31  64846   46437   16108   39703   9035
    2018-08-31  68352   49036   16905   42081   9366
    2018-09-30  64629   46379   15963   39842   8824
    2018-10-31  68093   48609   16806   41643   9644
    2018-11-30  74643   53052   18581   45073   10989
    2018-12-31  60270   43042   15030   36649   8591
    2019-01-31  76866   55463   18994   47789   10083
    2019-02-28  74705   53744   18170   46674   9861
    2019-03-31  78664   56562   19108   49144   10412
    2019-04-30  77760   56175   19356   48224   10180
    2019-05-31  88033   63219   22049   53859   12125
    2019-06-30  70370   50626   17448   43454   9468
    2019-07-31  76014   54531   18698   46947   10369
    2019-08-31  83509   60418   21600   50653   11256
    2019-09-30  77289   55375   19097   47517   10675
    2019-10-31  83514   60021   20761   51397   11356
    2019-11-30  81383   58460   20550   49551   11282
    2019-12-31  68307   49172   17092   41990   9225
    2020-01-31  59448   42384   14547   36472   8429
    2020-02-29  53862   38544   13687   32457   7718
    2020-03-31  62950   43478   14930   37403   10617
    2020-04-30  7796    5645    1968    4811    1017

What am I missing? Many thanks in advance!


Solution

  • I think this is a problem with US vs ISO (European) time format, i.e. YYYY-DD-MM vs YYYY-MM-DD, it looks like it reads 2018-01-04 as 4th of January and puts it into the 2018-01-31 block (i.e. January 2018).

    you want to set the option dayfirst=True in your pd.to_datetime call, see the Pandas doc for more details.