Search code examples
pythonpandasreportimport-from-excel

Parse Excel Report


I have an Excel workbook report(abc.xlsx) that I am trying to parse to get data from a couple of columns and I am wondering how this can be done in Python/Pandas. The report is in a report format so it is not nicely formatted like a row-column configuration and it looks like this:

                                 My Report
                                 ABC LLC
                        from 06/01/2015 to 06/30/2015 

Instrument                Identifier    Quantity    Start    End     Total 

North America
                            XYZ           100         0      0        (1,893.52)
North America Subtotal                                                (1,893.52)

Europe
                            ABC            50         10     20        (4,653.21)
Europe Subtotal                                                       (4,653.21     

The data that I am interested in getting is the identifier (in my Excel it is in columns B and C and they are merged) and Total (column G). The challenge here is this format is somewhat irregular and with a subtotal line. One way I can think of it read this via Pandas using read_excel and then iterate through the rows and just consider rows where column B&C and G are populated, but then not sure how to handle the situation of the headers plus the merged cells. This approach is also pretty ugly as well, and so I am wondering if anyone has experience parsing similar reports in Excel as well.


Solution

  • This is how I might handle this. It will not work in every data munging situation, of course, but it seems like it works on the example you provided.

    Going to use pd.read_excel, then skip the first several rows on import to just get them out of the way.

    In [1]: import pandas as pd
    
    In [2]: df = pd.read_excel("abc.xls", skiprows=4)
    
    In [3]: df
    Out[3]:
          Instrument Identifier  Quantity  Start       End    Total
    0            NaN        NaN       NaN    NaN       NaN      NaN
    1  North America        NaN       NaN    NaN       NaN      NaN
    2            NaN        XYZ       100      0         0  1893.52
    3  North America        NaN       NaN    NaN  Subtotal  1893.52
    4            NaN        NaN       NaN    NaN       NaN      NaN
    5         Europe        NaN       NaN    NaN       NaN      NaN
    6            NaN        ABC        50     10        20  4653.21
    7         Europe        NaN       NaN    NaN  Subtotal  4653.21
    

    Here I'll use the ffill argument of the fillna method to forward fill in NaN values in the Instruments column.

    In [4]: df.Instrument.fillna(method="ffill", inplace=True)
    
    In [5]: df
    Out[5]:
          Instrument Identifier  Quantity  Start       End    Total
    0            NaN        NaN       NaN    NaN       NaN      NaN
    1  North America        NaN       NaN    NaN       NaN      NaN
    2  North America        XYZ       100      0         0  1893.52
    3  North America        NaN       NaN    NaN  Subtotal  1893.52
    4  North America        NaN       NaN    NaN       NaN      NaN
    5         Europe        NaN       NaN    NaN       NaN      NaN
    6         Europe        ABC        50     10        20  4653.21
    7         Europe        NaN       NaN    NaN  Subtotal  4653.21
    

    Now just filter out the null values from the Identifier column.

    In [6]: df[df.Identifier.notnull()]
    Out[6]:
          Instrument Identifier  Quantity  Start End    Total
    2  North America        XYZ       100      0   0  1893.52
    6         Europe        ABC        50     10  20  4653.21