Consider this example df.
import pandas as pd
from io import StringIO
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(mycsv)
df
id date
0 1 11/07/2018
1 2 11/07/2918
2 3 02/01/2019
Clearly there was a typo there (2918 instead of 2018), but I'd like to parse it as a date nonetheless.
So let's check df.dtypes
id int64
date object
dtype: object
Ok, by default it was read as a string. So I'll explicitly tell read_csv
to parse that column as a date.
df = pd.read_csv(mycsv, parse_dates=["date"])
But df.dtypes
still shows date was read as a string (object dtype).
If I correct the typo ...
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2018\n3,02/01/2019")
it works
df = pd.read_csv(mycsv, parse_dates=["date"])
df
id date
0 1 2018-11-07
1 2 2018-11-07
2 3 2019-02-01
df.dtypes
id int64
date datetime64[ns]
dtype: object
So clearly it is failing to parse such an unrealistic date (11/07/2918) and then the whole column gets handled as string.
But why it cannot properly handle the 11/07/2918 date? and How can I make it correctly parse such date?
read_csv
documentation says that by default it uses dateutil.parser.parse
. And when you try by hand:
import dateutil
dateutil.parser.parse("13/07/2918")
It just works. No exception, no error and produces a valid datetime
object: datetime.datetime(2918, 7, 13, 0, 0)
Also converting that to numpy.datetime64
works
import dateutil
toy = dateutil.parser.parse("13/07/2918")
np.datetime64(toy)
It produces a valid and correctly parsed object.
numpy.datetime64('2918-07-13T00:00:00.000000')
Similarly, using pandas
' strptime
works all right and produces a valid datetime object.
pd.datetime.strptime("11/07/2918", "%d/%m/%Y")
Now, trying that with a custom date parser, just to make sure date-format is right
mycsv = StringIO("id,date\n1,11/07/2018\n2,11/07/2918\n3,02/01/2019")
df = pd.read_csv(
mycsv,
parse_dates=["date"],
date_parser=lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
)
Again df["date"].dtype
is dtype('O')
Ok, so I was giving up trying to convince read_csv
to properly parse the date. So I said, let's just convert it to date.
Either this
df["date"].astype("datetime64")
or this
pd.to_datetime(df["date"])
Throws and exception
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2918-07-11 00:00:00
Nothing seems to work.
Any ideas why this happens and how to make it work?
From the docs:
Since pandas represents timestamps in nanosecond resolution, the time span that can
be represented using a 64-bit integer is limited to approximately 584 years:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
How to represent out of bounds times: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-oob