Search code examples
pythonexcelpandasmulti-indexconverters

How to use converter in Pandas when using a MultiIndex


The issue

I have an excel table where the first row is a header and the second row is the unit of measurement for the rest of that column (i.e. nanometers, microns). Pandas provides an excellent read_excel function where I can pass a dictionary of converters. The key of the dictionary is the column name and the value is a lambda function that converts the excel value to some other value I want. In this case, the base value of whatever metric I'm using (nanometers to meters).

I cannot seem to figure out how to get my converter dictionary to use the second header row (the unit of measurement row). If I only specify my headers to take the unit row it works but I want the actual labels to be included in my header.

Here is my code

import numpy as np
import pandas as pd
import re
import os
from typing import Dict
from pandas.core.frame import DataFrame

Converters = {
  "GPa": lambda gpa: gpa * 1_000_000_000,
  "nm": lambda nm: nm / 1_000_000_000,
  "microns": lambda microns: microns / 1_000_000 
}

# Read and load metadata
directory = data_directory + "/" + metadata_directory
filenames = sorted(os.listdir(directory))
for filename in filenames:
  readData = pd.read_excel("./" + directory + "/" + filename, header=[0,1], converters=Converters)
  print(filename, "\n", readData.head(2))

OS Specs

Device name DESKTOP-AE4IMFH Processor Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz 1.50 GHz Installed RAM 12.0 GB (11.8 GB usable) Device ID 2B55F49B-6877-455D-88C5-D369A23FB40C Product ID 00325-96685-10579-AAOEM System type 64-bit operating system, x64-based processor Pen and touch Pen and touch support with 10 touch points

Edition Windows 10 Home Version 20H2 Installed on ‎7/‎23/‎2020 OS build 19042.1052 Experience Windows Feature Experience Pack 120.2212.2020.0

Python Version 3.9.5

What I've tried

Getting rid of the MultiIndex and just specifying the header as row 1 works great. However, I really want to have the column names as part of the header.

One thought was maybe to convert the DataFrame as a numpy array and then find the column index that matched each Converter function name. Then we could apply the conversion manually to each row at that column index. However, this feels hacky and would love to find a cleaner solution


Solution

  • I'm not sure I completely understand what you're trying to do. Nevertheless, here's a suggestion:

    In the following I'm using as an example an Excel-file test.xlsx with the content

    col_1  col_2  col_3
        1      2      3
        1      1      1
        2      2      2
        3      3      3
    

    This

    from operator import mul
    from functools import partial
    
    units = pd.read_excel('test.xlsx', nrows=1)
    converters = {
        col: partial(mul, 1 / units.at[0, col])
        for col in units.columns
    }
    df = pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
    

    produces the following dataframe df:

       col_1  col_2     col_3
    0    1.0    0.5  0.333333
    1    2.0    1.0  0.666667
    2    3.0    1.5  1.000000
    

    Here the row which contains the units isn't included. If you want to include it then replace the last line with:

    df = pd.concat([
             units,
             pd.read_excel('test.xlsx', skiprows=[1], converters=converters)
         ]).reset_index(drop=True)
    

    Result:

       col_1  col_2     col_3
    0    1.0    2.0  3.000000
    1    1.0    0.5  0.333333
    2    2.0    1.0  0.666667
    3    3.0    1.5  1.000000
    

    (If you're wondering why I haven't used lambdas for the definition of the converters: This usually fails if you're defining them via variables.)

    So, if you want to integrate that into your code it would look like:

    from operator import mul
    from functools import partial
    
    ...
    
    for filename in filenames:
        filepath = "./" + directory + "/" + filename
        units = pd.read_excel(filepath, nrows=1)
        converters = {
           col: partial(mul, 1 / units.at[0, col])
           for col in units.columns
        }
       readData = pd.read_excel(filepath, skiprows=[1], converters=converters)
    

    EDIT: After rethinking the question today I realized that the use of converters is probably not the best approach here. The converter functions are so basic (simple division) that there's a better solution available:

    for filename in filenames:
       readData = pd.read_excel("./" + directory + "/" + filename)
    
       # Version 1: Discarding row with units
       readData = (readData.iloc[1:, :] / readData.iloc[0, :]).reset_index(drop=True)
    
       # Version 2: Keeping row with units
       readData.iloc[1:, :] /= readData.iloc[0, :]