How to fix date format issues while reading xlsx files using polars?

Update: Polars now uses fastexcel as the default read_excel engine.

The given example now runs as expected.

I have an excel file with an extension .xlsx. and it has also a field called date_of_birth, this filed is having years from 1860,1870 to till 2000 as below

Here is a command that I used for loading an excel:

import tempfile
import polars as pl
from pathlib import Path

with tempfile.TemporaryDirectory() as tmp_dir:

    filename = Path(tmp_dir) / "1.xlsx"

    (pl.from_repr("""
    ┌────────────────┐
    │ date_of_birth  │
    │ ---            │
    │ str            │
    ╞════════════════╡
    │ 1/11/1887      │
    │ 1/11/1887      │
    │ 1/11/1889      │
    │ 1/12/1886      │
    │ 1/13/1889      │
    │ 1/15/1886      │
    │ 1/17/1888      │
    │ 1/18/1889      │
    │ 1/18/1885      │
    │ 1/2/1880       │    
    └────────────────┘
    """)
    .with_columns(pl.col.date_of_birth.str.to_date("%m/%d/%Y"))
    .write_excel(filename))

    df = pl.read_excel(filename)

    print(df)

On running this it gives an error:

XlsxValueError: Error: potential invalid date format.

How to ignore/Fix this error while reading the file so that I would get the data as it is in data frame. Is there any work around for this?

Solution

Well, the error is due to strftime function which does not support pre-1900 years.

Probably polars is using that and it causes the problem.

You may try not parsing the dates on polar function; so that you can read the CSV file (and dates stay as String). And when you need to parse the dates; just use strptime like:

datetime.datetime.strptime("1800/04/10", "%Y/%m/%d")

Also, you may try to use with_columns method of polars framework (I couldn't test it yet; will update after trying it):

df_pl = pl.read_excel("Data_Set_14_Data.xlsx").with_columns(
    pl.col("<last_col_name>").str.to_date("%m/%d/%Y")
)