Search code examples
pythonexcelformatxlrd

Python xlrd: Reading format of each cell values


I am currently working on a data engineering project. At the moment I want to read an Excel file with the extension '.xls' into my Python workspace. This is not a big problem with Pandas. However, I also want to import the format of the Excel file. That is, I want to read the color of each value in each cell and whether it is struck out.

I have tried different ways to solve this problem. Below you can see my last attempt, which leads to the information about the value of the font color for the whole cell and whether the cell is crossed out. But I only get one value for each cell, although there are many values in the cell and these values can be colored black, red, green, ... can be colored and can be struck out or not.

I open the workbook with xlrd. Then I go through each row and column of the cells. I read the font of the workbook. Then I save in strike_bool and font_color_bool whether the font corresponds to the permitted color and whether the cell is struck through. Depending on the type of the cell value, I save it in a filtered_row list in its correct data type. This list represents the row of the Excel file. I then save the formatted list of cell values of a row filtered_row in a list for the column values filtered_data. This list filtered_data is then converted into a Pandas data frame df_proper.

Right now, I receive for each cell only one value for the font color and one value for the struck out function. But I need the font color for each value in the cell. Same for struck out values in the cell-

How can I go through each value in a particular cell and check the color of each value in the cell and whether a particular value of the cell is struck out?

My Code:

import pandas as pd
import numpy as np
import xlrd
import math

palette = self.get_color_palette(data_path)
workbook = xlrd.open_workbook(data_path, formatting_info=True, on_demand=True)

for sheet_name in sheet_names:
    print(f'Read sheet: {sheet_name}')
    df_proper = pd.DataFrame()
    sheet = workbook.sheet_by_name(sheet_name)
                    
    # Iterate through the cells
    filtered_data = []
    for row in range(sheet.nrows):
        filtered_row = []
        for col in range(sheet.ncols):
            keep_cell = False
            cell = sheet.cell(row, col)
            xf_index = cell.xf_index
            xf = workbook.xf_list[xf_index]
            font_index = xf.font_index
            font = workbook.font_list[font_index]
            strike_bool = False
            # Check if cell is struck out
            if not font.struck_out:
                 strike_bool = True
            else:
                 strike_bool= False

            font_color_bool = False
            # Check if color meets condition
            if self.compare_font_color(palette[font.colour_index]):
                font_color_bool = True
            else:
                font_color_bool = False

            if font_color_bool and strike_bool:
                keep_cell = True
                                
            if cell.value == '':
                filtered_row.append(math.nan)
            elif isinstance(cell.value, (int, float)):
                if isinstance(cell.value, float) and cell.value.is_integer():
                    filtered_row.append(int(cell.value) if keep_cell else None)
                elif isinstance(cell.value, float):
                    filtered_row.append(float(cell.value) if keep_cell else None)
                else:
                    filtered_row.append(int(cell.value) if keep_cell else None)
            else:
                filtered_row.append(str(cell.value) if keep_cell else None)
        filtered_data.append(filtered_row)
    # DataFrame aus den gefilterten Daten erstellen
    df_proper = pd.DataFrame(filtered_data)
    dfs[sheet_name] = []
    dfs[sheet_name] = df_proper
    workbook.unload_sheet(sheet_name)

Example:

Column A Column B
That can be not a value Value 2
Value 3 Value 4

The script should recognize that "That" is written in color red and "not" ist struck out.


Solution

  • I think the key to your problem is to get the format of every single character of a cell in the xls file.
    I make an example table like this:

    enter image description here

    The text value of the cell of (0, 0) is ab, b is red color.
    The text value of the cell of (1, 0) is cd, d is struck out.

    Here is my code reading all these styles. python 3.7.9 and xlrd 1.2.0.

    import xlrd
    # accessing Column 'A' in this example
    COL_IDX = 0
    
    with xlrd.open_workbook('xls_file', formatting_info=True) as book:
        sheet = book.sheet_by_index(0)
        for row_idx in range(sheet.nrows):
            text_cell = sheet.cell_value(row_idx, COL_IDX)
            text_cell_xf = book.xf_list[sheet.cell_xf_index(row_idx, COL_IDX)]
    
            # skip rows where cell is empty
            if not text_cell:
                continue
    
            print(f'============\nText of ({row_idx}, {COL_IDX}) is `{text_cell}`')
    
            text_cell_runlist = sheet.rich_text_runlist_map.get((row_idx, COL_IDX))
            if text_cell_runlist:
                print('============\nStyle segments of this cell:')
    
            segments = []
            for segment_idx in range(len(text_cell_runlist)):
                start = text_cell_runlist[segment_idx][0]
                # the last segment starts at given 'start' and ends at the end of the string
                end = None
                if segment_idx != len(text_cell_runlist) - 1:
                    end = text_cell_runlist[segment_idx + 1][0]
                segment_text = text_cell[start:end]
                segments.append({
                    'text': segment_text,
                    'font': book.font_list[text_cell_runlist[segment_idx][1]]
                })
    
                # segments did not start at beginning, assume cell starts with text styled as the cell
                if text_cell_runlist[0][0] != 0:
                    segments.insert(0, {
                        'text': text_cell[:text_cell_runlist[0][0]],
                        'font': book.font_list[text_cell_xf.font_index]
                    })
                for segment in segments:
                    print('------------\nTEXT:', segment['text'])
                    print('color:', segment['font'].colour_index)
                    print('bold:', segment['font'].bold)
                    print('struck out:', segment['font'].struck_out)
                else:
                    print('------------\nCell whole style')
                    print('italic:', book.font_list[text_cell_xf.font_index].italic)
                    print('bold:', book.font_list[text_cell_xf.font_index].bold)
                    print('struck out:', book.font_list[text_cell_xf.font_index].struck_out)
    

    Output:

    ============
    Text of (0, 0) is `ab`
    ============
    Style segments of this cell:
    ------------
    TEXT: a
    color: 32767
    bold: 0
    struck out: 0
    ------------
    TEXT: b
    color: 10
    bold: 0
    struck out: 0
    ------------
    Cell whole style
    italic: 0
    bold: 0
    struck out: 0
    ============
    Text of (1, 0) is `cd`
    ============
    Style segments of this cell:
    ------------
    TEXT: c
    color: 32767
    bold: 0
    struck out: 0
    ------------
    TEXT: d
    color: 32767
    bold: 0
    struck out: 1
    ------------
    Cell whole style
    italic: 0
    bold: 0
    struck out: 0
    

    Additional:
    enter image description here