Parse Excel Report

I have an Excel workbook report(abc.xlsx) that I am trying to parse to get data from a couple of columns and I am wondering how this can be done in Python/Pandas. The report is in a report format so it is not nicely formatted like a row-column configuration and it looks like this:

                                 My Report
                                 ABC LLC
                        from 06/01/2015 to 06/30/2015 

Instrument                Identifier    Quantity    Start    End     Total 

North America
                            XYZ           100         0      0        (1,893.52)
North America Subtotal                                                (1,893.52)

Europe
                            ABC            50         10     20        (4,653.21)
Europe Subtotal                                                       (4,653.21

The data that I am interested in getting is the identifier (in my Excel it is in columns B and C and they are merged) and Total (column G). The challenge here is this format is somewhat irregular and with a subtotal line. One way I can think of it read this via Pandas using read_excel and then iterate through the rows and just consider rows where column B&C and G are populated, but then not sure how to handle the situation of the headers plus the merged cells. This approach is also pretty ugly as well, and so I am wondering if anyone has experience parsing similar reports in Excel as well.

Solution

This is how I might handle this. It will not work in every data munging situation, of course, but it seems like it works on the example you provided.

Going to use pd.read_excel, then skip the first several rows on import to just get them out of the way.

In [1]: import pandas as pd

In [2]: df = pd.read_excel("abc.xls", skiprows=4)

In [3]: df
Out[3]:
      Instrument Identifier  Quantity  Start       End    Total
0            NaN        NaN       NaN    NaN       NaN      NaN
1  North America        NaN       NaN    NaN       NaN      NaN
2            NaN        XYZ       100      0         0  1893.52
3  North America        NaN       NaN    NaN  Subtotal  1893.52
4            NaN        NaN       NaN    NaN       NaN      NaN
5         Europe        NaN       NaN    NaN       NaN      NaN
6            NaN        ABC        50     10        20  4653.21
7         Europe        NaN       NaN    NaN  Subtotal  4653.21

Here I'll use the ffill argument of the fillna method to forward fill in NaN values in the Instruments column.

In [4]: df.Instrument.fillna(method="ffill", inplace=True)

In [5]: df
Out[5]:
      Instrument Identifier  Quantity  Start       End    Total
0            NaN        NaN       NaN    NaN       NaN      NaN
1  North America        NaN       NaN    NaN       NaN      NaN
2  North America        XYZ       100      0         0  1893.52
3  North America        NaN       NaN    NaN  Subtotal  1893.52
4  North America        NaN       NaN    NaN       NaN      NaN
5         Europe        NaN       NaN    NaN       NaN      NaN
6         Europe        ABC        50     10        20  4653.21
7         Europe        NaN       NaN    NaN  Subtotal  4653.21

Now just filter out the null values from the Identifier column.

In [6]: df[df.Identifier.notnull()]
Out[6]:
      Instrument Identifier  Quantity  Start End    Total
2  North America        XYZ       100      0   0  1893.52
6         Europe        ABC        50     10  20  4653.21