Search code examples
pythonpandasdatetimetimezonepython-dateutil

Why does read_csv give me a timezone warning?


I try reading a CSV file using pandas and get a warning I do not understand:

Lib\site-packages\dateutil\parser\_parser.py:1207: UnknownTimezoneWarning: tzname B identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  warnings.warn("tzname {tzname} identified but not understood.  "

I do nothing special, just pd.read_csv with parse_dates=True. I see no B that looks like a timezone anywhere in my data. What does the warning mean?

A minimal reproducible example is the following:

import io
import pandas as pd
pd.read_csv(io.StringIO('x\n1A2B'), index_col=0, parse_dates=True)

Why does pandas think 1A2B is a datetime?!

To solve this, I tried adding dtype={'x': str} to force the column into a string. But I keep getting the warning regardless...


Solution

  • It turns out 1A2B is being interpreted as "1 AM on day 2 of the current month, timezone B". By default, read_csv uses dateutil to detect datetime values (date_parser=):

    import dateutil.parser
    dateutil.parser.parse('1A2B')
    

    Apart from the warning, this returns (today):

    datetime.datetime(2023, 1, 2, 1, 0)
    

    And B is not a valid timezone specifier indeed.

    Why adding dtype doesn't help stays to be investigated.

    I did find a simple hack that works:

    import dateutil.parser
    def dateparse(self, timestr, default=None, ignoretz=False, tzinfos=None, **kwargs):
        return self._parse(timestr, **kwargs)
    dateutil.parser.parser.parse = dateparse  # Monkey patch; hack!
    

    This prevents using the current day/month/year as defaults, rendering the value invalid as a datetime as expected.