Search code examples
pythonpython-3.xpandaspython-datetime

Why Pandas refuse to read a date 9 centuries into the future?


Consider this example df.


import pandas as pd
from io import StringIO
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(mycsv)
df
    id  date
0   1   11/07/2018
1   2   11/07/2918
2   3   02/01/2019

Clearly there was a typo there (2918 instead of 2018), but I'd like to parse it as a date nonetheless.

So let's check df.dtypes

id       int64
date    object
dtype: object

Ok, by default it was read as a string. So I'll explicitly tell read_csv to parse that column as a date.

df = pd.read_csv(mycsv, parse_dates=["date"])

But df.dtypes still shows date was read as a string (object dtype).

If I correct the typo ...


mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2018\n3,02/01/2019")

it works

df = pd.read_csv(mycsv, parse_dates=["date"])
df

    id  date
0   1   2018-11-07
1   2   2018-11-07
2   3   2019-02-01

df.dtypes

id               int64
date    datetime64[ns]
dtype: object

So clearly it is failing to parse such an unrealistic date (11/07/2918) and then the whole column gets handled as string.

But why it cannot properly handle the 11/07/2918 date? and How can I make it correctly parse such date?

read_csv documentation says that by default it uses dateutil.parser.parse. And when you try by hand:

import dateutil
dateutil.parser.parse("13/07/2918") 

It just works. No exception, no error and produces a valid datetime object: datetime.datetime(2918, 7, 13, 0, 0)

Also converting that to numpy.datetime64 works

import dateutil
toy = dateutil.parser.parse("13/07/2918")
np.datetime64(toy)

It produces a valid and correctly parsed object.

numpy.datetime64('2918-07-13T00:00:00.000000')

Similarly, using pandas' strptime works all right and produces a valid datetime object.

pd.datetime.strptime("11/07/2918", "%d/%m/%Y")

Now, trying that with a custom date parser, just to make sure date-format is right

mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(
    mycsv,
    parse_dates=["date"],
    date_parser=lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
)

Again df["date"].dtype is dtype('O')

Ok, so I was giving up trying to convince read_csv to properly parse the date. So I said, let's just convert it to date.

Either this

df["date"].astype("datetime64")

or this

pd.to_datetime(df["date"])

Throws and exception

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2918-07-11 00:00:00

Nothing seems to work.

Any ideas why this happens and how to make it work?


Solution

  • From the docs:

    Since pandas represents timestamps in nanosecond resolution, the time span that can 
    be represented using a 64-bit integer is limited to approximately 584 years:
    
    In [92]: pd.Timestamp.min
    Out[92]: Timestamp('1677-09-21 00:12:43.145225')
    
    In [93]: pd.Timestamp.max
    Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
    

    How to represent out of bounds times: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob