Search code examples
pythonpandasdatetimedate-parsingxlsb

Pandas read_excel doesn't parse dates correctly - returns a constant date instead


I've read a .xlsb file and parsed date columns using a code below:

dateparser = lambda x: pd.to_datetime(x)

data = pd.read_excel(r"test.xlsb", engine="pyxlsb",
                 parse_dates=["start_date","end_date"],
                 date_parser=dateparser
                 )

My input columns in the .xlsb file have format DD/MM/YYYY (e.g. 26/01/2008). As an output of the above-mentioned code I get, for example: 1970-01-01 00:00:00.000038840. Only the last 5 digits changes.

If I read the same file without parsing dates, the same columns are of float64 type and containing only the last 5 digits of output before (e.g. 38840.0).

I assume this is a problem associated with date encoding itself. Does anyone know how to fix this issue?


Solution

  • I am not sure if you were able to figure out the answer to this problem. But, below is how I resolved it:

    from pyxlsb import convert_date
    
    self.data: pd.DataFrame = pd.read_excel(self.file, sheet_name=self.sheet, engine='pyxlsb', header=0)
    self.data["test"] = self.data.apply(lambda x: convert_date(x.SomeStupidDate), axis=1)
    

    More details can be found here: https://pypi.org/project/pyxlsb/ by doing ctrl+F for "convert_date".