Search code examples
pythonpandasexcelmulti-index

Reading multiindex from Excel - how do I reset index without losing leading zeroes?


As the title states, df.reset_index() is currently dropping leading zeroes in several of my columns because the columns are converted to int64. The Excel-files consists of product and business-codes, showing which codes belong in different aggregates. Different codes may have a varying number of leading zeroes, and the codes themselves has a varying number if digits, so .zfill() doesn't quite cut it. I have a quick-fix at hand, modifying the data, but I would like to learn how to circumnavigate such issues in the future.

This is how I currently read the files within a loop:

df = pd.read_excel("mappingOfAggregates.xlsx", index_col=[0,1,2,3], dtype=str)

I have also used df = df.astype('object') after each file has been read, but it does not seem to affect the final result when I use df.reset_index().

Chat GPT suggests that I reset index and then use .astype('object') on each column - which works as well as one would expect.

Any pointers in the right direction is received with thanks.


Solution

  • You can retain the indexes with index_col argument but to keep leading zeros you need to specify the type of the column as string not a numeric value.

    Assuming you have columns A, B, C and D as the index, you can achieve that by providing a dictionary with each column name corresponds to a str, 'str' or pd.StringDtype() as follows:

    pd.read_excel(
        "mappingOfAggregates.xlsx",
        index_col=[0, 1, 2, 3],
        dtypes={
            "A": pd.StringDtype(),
            "B": pd.StringDtype(),
            "C": pd.StringDtype(),
            "D": pd.StringDtype(),
        },
    )