Reading multiindex from Excel - how do I reset index without losing leading zeroes?

As the title states, df.reset_index() is currently dropping leading zeroes in several of my columns because the columns are converted to int64. The Excel-files consists of product and business-codes, showing which codes belong in different aggregates. Different codes may have a varying number of leading zeroes, and the codes themselves has a varying number if digits, so .zfill() doesn't quite cut it. I have a quick-fix at hand, modifying the data, but I would like to learn how to circumnavigate such issues in the future.

This is how I currently read the files within a loop:

df = pd.read_excel("mappingOfAggregates.xlsx", index_col=[0,1,2,3], dtype=str)

I have also used df = df.astype('object') after each file has been read, but it does not seem to affect the final result when I use df.reset_index().

Chat GPT suggests that I reset index and then use .astype('object') on each column - which works as well as one would expect.

Any pointers in the right direction is received with thanks.

Solution

You can retain the indexes with index_col argument but to keep leading zeros you need to specify the type of the column as string not a numeric value.

Assuming you have columns A, B, C and D as the index, you can achieve that by providing a dictionary with each column name corresponds to a str, 'str' or pd.StringDtype() as follows:

pd.read_excel(
    "mappingOfAggregates.xlsx",
    index_col=[0, 1, 2, 3],
    dtypes={
        "A": pd.StringDtype(),
        "B": pd.StringDtype(),
        "C": pd.StringDtype(),
        "D": pd.StringDtype(),
    },
)