Search code examples
pandasmulti-index

Importing excel spreadsheet with user formatted tables


I am trying to import an excel spreadsheet in pandas but I have some issue due to data being "structured visually by user" to be user friendly visually and not using proper data structures.

enter image description here As a result, when imported in pandas as dataframe I get this:

>>> df                 
   Unnamed: 0  Unnamed: 1  Unnamed: 2 Unnamed: 3 Unnamed: 4     Unnamed: 5 Unnamed: 6 Unnamed: 7     Unnamed: 8 Unnamed: 9 Unnamed: 10    Unnamed: 11 Unnamed: 12
0         NaN         NaN         NaN        NaN        NaN            NaN        NaN        NaN            NaN        NaN         NaN            NaN         NaN
1         NaN         NaN         NaN        NaN        NaN            NaN        NaN        NaN            NaN        NaN         NaN            NaN         NaN
2         NaN         NaN         NaN        NaN        NaN            NaN        NaN        NaN            NaN        NaN         NaN            NaN         NaN
3         NaN         NaN         NaN        NaN   ROUTER 1            NaN        NaN   ROUTER 2            NaN        NaN    ROUTER 3            NaN         NaN
4         NaN         NaN         NaN    SERVERS         IP        NETMASK    GATEWAY         IP        NETMASK    GATEWAY          IP        NETMASK     GATEWAY
5         NaN         NaN         NaN   SERVER 1  10.0.0.10  255.255.255.0   10.0.0.1  10.1.0.10  255.255.255.0   10.0.0.1   10.2.0.10  255.255.255.0    10.0.0.1
6         NaN         NaN         NaN   SERVER 2  10.0.1.10  255.255.255.0   10.0.1.1  10.1.1.10  255.255.255.0   10.0.1.1   10.2.1.10  255.255.255.0    10.0.1.1

What I want to achieve as desired output is this:

>>> result
                           ROUTER 1                              ROUTER 2                              ROUTER 3
 SERVERS         IP        NETMASK    GATEWAY         IP        NETMASK    GATEWAY          IP        NETMASK     GATEWAY
 SERVER 1  10.0.0.10  255.255.255.0   10.0.0.1  10.1.0.10  255.255.255.0   10.0.0.1   10.2.0.10  255.255.255.0    10.0.0.1
 SERVER 2  10.0.1.10  255.255.255.0   10.0.1.1  10.1.1.10  255.255.255.0   10.0.1.1   10.2.1.10  255.255.255.0    10.0.1.1

where I get proper multi-indexing of columns (which is exactly what I have in Excel..)

((ROUTER 1, [ IP, NETMASK, GATEWAY]), (ROUTER 2, [ IP, NETMASK, GATEWAY]), (ROUTER 3, [ IP, NETMASK, GATEWAY]))

My attempts so far are rather poor.. I got stuck here..

>>> pd.read_excel('Test.xlsx',header=3).dropna(axis=1, how='all').stack().unstack(0).reset_index(drop=True).fillna(method='ffill')                                                                                          
          0        1              2              3
0       NaN  SERVERS       SERVER 1       SERVER 2
1  ROUTER 1       IP      10.0.0.10      10.0.1.10
2  ROUTER 1  NETMASK  255.255.255.0  255.255.255.0
3  ROUTER 1  GATEWAY       10.0.0.1       10.0.1.1
4  ROUTER 2       IP      10.1.0.10      10.1.1.10
5  ROUTER 2  NETMASK  255.255.255.0  255.255.255.0
6  ROUTER 2  GATEWAY       10.0.0.1       10.0.1.1
7  ROUTER 3       IP      10.2.0.10      10.2.1.10
8  ROUTER 3  NETMASK  255.255.255.0  255.255.255.0
9  ROUTER 3  GATEWAY       10.0.0.1       10.0.1.1

Thank you so much for your help in advance.


Solution

  • Combining what you've tried with what you've been advised by @Mayank, here is my 2 cents:

    >>> pd.read_excel('Test.xls', skiprows=3, usecols="D:M")
            .stack()
            .unstack(0)
            .reset_index(drop=True)
            .fillna(method='ffill')
            .pivot_table(columns=[0,1] aggfunc='first')
    

    I think that should give you what you're looking for.