Search code examples
pythonpandasheader

Print column information for a pandas frame with multiple header


I have a question hoping you can help me out here. I have an excel-file with multiple header, which I read following way

df=pd.read_excel(path,sheet_name="name",header=[1,2,3,4,5,6]

The original excel files looks like this

Required                         1 - Person
Legacy Contact Person Number     Business partner category
Central Data                     Central Data
Header Data                      Header Data
CHAR                             CHAR 
10                               1
PARTNER                          TYPE 
Test 1                           Name 1 
Test 2                           Name 2
Test 3                           Name 3
Test 4                           Name 4
Test 5                           Name 5

Problem: I am trying to find an easy way to have a following statement

df["PARTNER"]=df["PARTNER"].fillna("Missing")

I realized that there is an issue by printing the command

print(df["PARTNER"])

Question: Is there any way on how to print a column in a frame with multiple header and select the header I am interested in, in this case "PARTNER"?


Solution

  • To select (print) data from a particular level of a MultiIndex you can use xs:

    df = pd.DataFrame({('Legacy Contact Person Number', 'Central Data', 'Header Data', 'CHAR', '10', 'PARTNER'): ['Test 1', None, 'Test 3', 'Test 4', 'Test 5'],
                       ('Business partner category', 'Central Data', 'Header Data', 'CHAR', '1', 'TYPE'): ['Name 1', 'Name 2', 'Name 3', 'Name 4', 'Name 5']})
    
    print(df.xs(('PARTNER'), level=5, axis=1, drop_level=False))
    

    Output:

      Legacy Contact Person Number
                      Central Data
                       Header Data
                              CHAR
                                10
                           PARTNER
    0                       Test 1
    1                         None
    2                       Test 3
    3                       Test 4
    4                       Test 5
    

    However, you can't use xs to set data. This is possible with an IndexSlice:

    idx = pd.IndexSlice
    df.loc[:, idx[:, :, :, :, :, 'PARTNER']] = df.loc[:, idx[:, :, :, :, :, 'PARTNER']].fillna('Missing')
    

    Result:

      Legacy Contact Person Number Business partner category
                      Central Data              Central Data
                       Header Data               Header Data
                              CHAR                      CHAR
                                10                         1
                           PARTNER                      TYPE
    0                       Test 1                    Name 1
    1                      Missing                    Name 2
    2                       Test 3                    Name 3
    3                       Test 4                    Name 4
    4                       Test 5                    Name 5
    

    Instead of df.loc[:, idx[:, :, :, :, :, 'PARTNER']] you can also use df.loc(axis=1)[:, :, :, :, :, 'PARTNER'].