Search code examples
pythonexcelpython-3.xpandas

Pandas merged cell issue when reading from excel


So I have a xls file with some weird style but there's nothing I can do about it so I just need to parse it.

enter image description here

As you can see I have some merged cells. What I want to do is fill the empty values for the merged cells ("ffill") but also keep the empty cells like they are.

Something like this

EIM, C,NI1 Enescu_Ioan, EIM, S,NI11,Enescu_Ioan EIM, C,NI1 Enescu_Ioan, Empty EIM, C,NI1 Enescu_Ioan EIM, S,NI11,Enescu_Ioan EIM, C,NI1,Enescu_Ioan Empty The way I'm loading the file right now is this.

xl = pd.ExcelFile("data/file.xls")
df = xl.parse(0, header=None)

I've also tried to open the file like this and access the merged cells but I get an empty list.

book = xlrd.open_workbook("data/file.xls")
book.sheet_by_index(0).merged_cells # This is empty []

Is there any way I could achieve this? Thanks!

EDIT

There might be some confusions regarding the question so I'll try to explain better. The attached image is a subset of a larger file where the columns may appear in different order. What I'm trying to achieve is a way of differentiating between merged cells NAN values (in a merged cell only the first column has a value, the rest are all nan) and empty cells NAN.


Solution

  • Managed to find a fix

    def read_excel(path):
        excel = None
        if path.endswith('xlsx'):
            excel = pd.ExcelFile(xlrd.open_workbook(path), engine='xlrd')
        elif path.endswith('xls'):
            excel = pd.ExcelFile(xlrd.open_workbook(path, formatting_info=True), engine='xlrd')
        else:
            raise ValueError("Could not read this type of data")
        return excel
    
    def parse_excel(excel_file):
        sheet_0 = excel_file.book.sheet_by_index(0)
        df = excel_file.parse(0, header=None)
        return sheet_0, df
    
    def fill_merged_na(sheet, dataframe):
        for e in sheet.merged_cells:
            rl, rh, cl, ch = e
            base_value = sheet.cell_value(rl, cl)
            dataframe.iloc[rl:rh, cl:ch] = base_value
        return dataframe
    

    Some of the important bits are opening the excel file with the formatting_info set to True in order to also read formatting such as merged cells and the fill_merged_na function that fills only the merged nan values but leaves the initial empty cells as they were.