I am using polars v1.12.0 to read data from an Excel sheet.
pl.read_excel(
"test.xlsx",
sheet_name="test",
has_header=True,
columns=list(range(30, 49))
)
The requested columns are being imported correctly. However, polars adds a suffix _1
to every column name. There's one column header where a _3
has been added.
In the requested columns, all column headers are unique, i.e. no duplicates. However, columns before this import area do have the same values. For example, the header that has been suffixed _3
does occur two times before my import area.
It looks like polars is scanning all column headers from column "A" starting, no matter if I start to read from column "AE".
I am wondering what is going on? Is this a bug or did I make a mistake?
I don't think you have made a mistake, the behaviour just seems to differ wildly between different engines, and none of them do what you want to do.
I have the following excel:
alpha | bravo | charlie | charlie | delta | echo | foxtrot | alfa
1 | a | 1 | a | 1 | a | 1 | a
For the following code snippet:
df = pl.read_excel(
"test.xlsx",
sheet_name="test",
has_header=True,
columns=[3, 4, 5, 6, 7],
)
Here's what I get when using different excel engines:
┌───────────┬───────┬──────┬─────────┬────────┐
│ charlie_1 ┆ delta ┆ echo ┆ foxtrot ┆ alfa_1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ str │
╞═══════════╪═══════╪══════╪═════════╪════════╡
│ a ┆ 1 ┆ a ┆ 1 ┆ a │
└───────────┴───────┴──────┴─────────┴────────┘
So the sequence seems to be:
columns
┌─────────┬───────────────────┐
│ foxtrot ┆ alfa_duplicated_0 │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════╪═══════════════════╡
│ 1 ┆ a │
└─────────┴───────────────────┘
Yes, really, it's dropping charlie, delta and echo completely. I think that's a straight up bug. If you start the indexing from 0 and list all the columns, it shows all columns, but if you start from 1, it already removes alfa
AND bravo
.
┌───────┬──────┬─────────┐
│ delta ┆ echo ┆ foxtrot │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═══════╪══════╪═════════╡
│ 1 ┆ a ┆ 1 │
└───────┴──────┴─────────┘
This is now dropping all the columns with the duplicate names first, and then taking the column indices defined in columns
. Strictly speaking, not even dropping, but
columns
without taking the duplicate names in to accountThe columns 3, 4 and 5 are now delta, echo foxtrot
, and 6 and 7 point nowhere.
So, based on this, your best bet I think is to use the default calamine
engine and then manually override the columns:
df.columns = ["charlie", "delta", "echo", "foxtrot", "alfa"]
As for your dilemma in comments about stacking the differently named columns, this "works", but only when you know the column names and the schema beforehand. It is also stupidly ugly and probably not very performant. I hope there are better ways.
I any case, a solution along the lines of reading all excel columns and then manipulating the df is probably easier than trying to manipulate the reader.
For excel with columns alfa | bravo | charlie | alfa | bravo | charlie
import polars as pl
df = pl.read_excel(
"test.xlsx",
sheet_name="test",
has_header=True,
)
new_df = pl.DataFrame(
schema={"alfa": pl.Int64, "bravo": pl.Int64, "charlie": pl.Int64})
n = 3
for i in range(0, len(df.columns) // n):
slice = df.select(pl.nth(range(i * n, i * final_columns + n)))
slice.columns = ["alfa", "bravo", "charlie"]
new_df = new_df.vstack(
slice,
)