Search code examples
pythonpython-3.xpandasdataframepython-datetime

Adding months to a date which is bigger than the limit of Timestamp type


I have a df where one of the column is a date with a datetime64[ns] type.

Over this column I want to add months using another column of the dataframe as a base:

df['date_shifted']=df['date'].values.astype('datetime64[M]')+(df['months']).values.astype('timedelta64[M]')

My problem comes when I exceed the maximum of the datetime64 type and I retrieve the following error:

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2846-04-30 00:00:00

Are there any way to work around this error and add the months that I need to my dataframe?

Some example of possible data which would be an error after the calculation:

date months
28-01-2017 9999
13-05-2018 9999
22-03-2016 9999
05-12-2007 9999

Note: I know that I can coerce the errors to NaT but I need the dates for following calculations.


Solution

  • You can use Periods, as per the Representing out-of-bounds spans section of the guide on timestamps posted by @HenryEcker in comments. To convert the column simply use .dt.to_period():

    >>> df['date'].dt.to_period(freq='M')
    0    2017-01
    1    2018-05
    2    2016-03
    3    2007-05
    Name: date, dtype: period[M]
    

    The rest is easy, adding the int64 months can even be done without conversion:

    >>> df['shifted_date'] = df['date'].dt.to_period(freq='M') + df['months']
    >>> df
            date  months shifted_date
    0 2017-01-28    9999      2850-04
    1 2018-05-13    9999      2851-08
    2 2016-03-22    9999      2849-06
    3 2007-05-12    9999      2840-08
    >>> df['shifted_date']
    0    2850-04
    1    2851-08
    2    2849-06
    3    2840-08
    Name: shifted_date, dtype: period[M]
    

    Based on the dates you have you could use a smaller granularity period:

    >>> df['shifted_date'].astype('Period[D]')
    0    2850-04-30
    1    2851-08-31
    2    2849-06-30
    3    2840-08-31
    Name: shifted_date, dtype: period[D]
    

    Going back to datetimes would trigger the overflow you’re trying to avoid:

    >>> df['shifted_date'].dt.start_time
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python3.8/site-packages/pandas/core/accessor.py", line 78, in _getter
        return self._delegate_property_get(name)
      File "/usr/lib64/python3.8/site-packages/pandas/core/indexes/accessors.py", line 70, in _delegate_property_get
        result = getattr(values, name)
      File "/usr/lib64/python3.8/site-packages/pandas/core/arrays/period.py", line 420, in start_time
        return self.to_timestamp(how="start")
      File "/usr/lib64/python3.8/site-packages/pandas/core/arrays/period.py", line 465, in to_timestamp
        new_data = libperiod.periodarr_to_dt64arr(new_data.asi8, base)
      File "pandas/_libs/tslibs/period.pyx", line 977, in pandas._libs.tslibs.period.periodarr_to_dt64arr
      File "pandas/_libs/tslibs/conversion.pyx", line 246, in pandas._libs.tslibs.conversion.ensure_datetime64ns
      File "pandas/_libs/tslibs/np_datetime.pyx", line 113, in pandas._libs.tslibs.np_datetime.check_dts_bounds
    pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2850-04-01 00:00:00