Search code examples
pythonpython-polars

Polars read_excel incorrectly adds suffix to column names


I am using polars v1.12.0 to read data from an Excel sheet.

pl.read_excel(
    "test.xlsx",
    sheet_name="test",
    has_header=True,
    columns=list(range(30, 49))
)

The requested columns are being imported correctly. However, polars adds a suffix _1 to every column name. There's one column header where a _3 has been added.

In the requested columns, all column headers are unique, i.e. no duplicates. However, columns before this import area do have the same values. For example, the header that has been suffixed _3 does occur two times before my import area.

It looks like polars is scanning all column headers from column "A" starting, no matter if I start to read from column "AE".

I am wondering what is going on? Is this a bug or did I make a mistake?


Solution

  • I don't think you have made a mistake, the behaviour just seems to differ wildly between different engines, and none of them do what you want to do.

    I have the following excel:

    alpha | bravo | charlie | charlie | delta | echo | foxtrot | alfa
        1 |     a |       1 |       a |     1 |    a |       1 |    a
    

    For the following code snippet:

        df = pl.read_excel(
            "test.xlsx",
            sheet_name="test",
            has_header=True,
            columns=[3, 4, 5, 6, 7],
        )
    

    Here's what I get when using different excel engines:

    Calamine (default)

    ┌───────────┬───────┬──────┬─────────┬────────┐
    │ charlie_1 ┆ delta ┆ echo ┆ foxtrot ┆ alfa_1 │
    │ ---       ┆ ---   ┆ ---  ┆ ---     ┆ ---    │
    │ str       ┆ i64   ┆ str  ┆ i64     ┆ str    │
    ╞═══════════╪═══════╪══════╪═════════╪════════╡
    │ a         ┆ 1     ┆ a    ┆ 1       ┆ a      │
    └───────────┴───────┴──────┴─────────┴────────┘
    

    So the sequence seems to be:

    • Read all the columns, adding postfix to duplicates
    • Select only columns mentioned in columns

    Xlsx2csv (previous default)

    ┌─────────┬───────────────────┐
    │ foxtrot ┆ alfa_duplicated_0 │
    │ ---     ┆ ---               │
    │ i64     ┆ str               │
    ╞═════════╪═══════════════════╡
    │ 1       ┆ a                 │
    └─────────┴───────────────────┘
    

    Yes, really, it's dropping charlie, delta and echo completely. I think that's a straight up bug. If you start the indexing from 0 and list all the columns, it shows all columns, but if you start from 1, it already removes alfa AND bravo.

    Openpyxl

    ┌───────┬──────┬─────────┐
    │ delta ┆ echo ┆ foxtrot │
    │ ---   ┆ ---  ┆ ---     │
    │ i64   ┆ str  ┆ i64     │
    ╞═══════╪══════╪═════════╡
    │ 1     ┆ a    ┆ 1       │
    └───────┴──────┴─────────┘
    

    This is now dropping all the columns with the duplicate names first, and then taking the column indices defined in columns. Strictly speaking, not even dropping, but

    • first taking them in and keeping the column ordering
    • then overriding duplicate column data with the last column of the same name
    • filtering columns based on the indices in columns without taking the duplicate names in to account

    The columns 3, 4 and 5 are now delta, echo foxtrot, and 6 and 7 point nowhere.

    What to do

    So, based on this, your best bet I think is to use the default calamine engine and then manually override the columns:

    df.columns = ["charlie", "delta", "echo", "foxtrot", "alfa"]
    

    As for your dilemma in comments about stacking the differently named columns, this "works", but only when you know the column names and the schema beforehand. It is also stupidly ugly and probably not very performant. I hope there are better ways.

    I any case, a solution along the lines of reading all excel columns and then manipulating the df is probably easier than trying to manipulate the reader.

    For excel with columns alfa | bravo | charlie | alfa | bravo | charlie

    import polars as pl
    
    df = pl.read_excel(
        "test.xlsx",
        sheet_name="test",
        has_header=True,
    )
    
    
    new_df = pl.DataFrame(
        schema={"alfa": pl.Int64, "bravo": pl.Int64, "charlie": pl.Int64})
    
    n = 3
    
    for i in range(0, len(df.columns) // n):
        slice = df.select(pl.nth(range(i * n, i * final_columns + n)))
        slice.columns = ["alfa", "bravo", "charlie"]
    
        new_df = new_df.vstack(
            slice,
        )