I am currently working on a data engineering project. At the moment I want to read an Excel file with the extension '.xls' into my Python workspace. This is not a big problem with Pandas. However, I also want to import the format of the Excel file. That is, I want to read the color of each value in each cell and whether it is struck out.
I have tried different ways to solve this problem. Below you can see my last attempt, which leads to the information about the value of the font color for the whole cell and whether the cell is crossed out. But I only get one value for each cell, although there are many values in the cell and these values can be colored black, red, green, ... can be colored and can be struck out or not.
I open the workbook with xlrd. Then I go through each row and column of the cells. I read the font of the workbook. Then I save in strike_bool
and font_color_bool
whether the font corresponds to the permitted color and whether the cell is struck through. Depending on the type of the cell value, I save it in a filtered_row
list in its correct data type. This list represents the row of the Excel file. I then save the formatted list of cell values of a row filtered_row
in a list for the column values filtered_data
. This list filtered_data
is then converted into a Pandas data frame df_proper
.
Right now, I receive for each cell only one value for the font color and one value for the struck out function. But I need the font color for each value in the cell. Same for struck out values in the cell-
How can I go through each value in a particular cell and check the color of each value in the cell and whether a particular value of the cell is struck out?
My Code:
import pandas as pd
import numpy as np
import xlrd
import math
palette = self.get_color_palette(data_path)
workbook = xlrd.open_workbook(data_path, formatting_info=True, on_demand=True)
for sheet_name in sheet_names:
print(f'Read sheet: {sheet_name}')
df_proper = pd.DataFrame()
sheet = workbook.sheet_by_name(sheet_name)
# Iterate through the cells
filtered_data = []
for row in range(sheet.nrows):
filtered_row = []
for col in range(sheet.ncols):
keep_cell = False
cell = sheet.cell(row, col)
xf_index = cell.xf_index
xf = workbook.xf_list[xf_index]
font_index = xf.font_index
font = workbook.font_list[font_index]
strike_bool = False
# Check if cell is struck out
if not font.struck_out:
strike_bool = True
else:
strike_bool= False
font_color_bool = False
# Check if color meets condition
if self.compare_font_color(palette[font.colour_index]):
font_color_bool = True
else:
font_color_bool = False
if font_color_bool and strike_bool:
keep_cell = True
if cell.value == '':
filtered_row.append(math.nan)
elif isinstance(cell.value, (int, float)):
if isinstance(cell.value, float) and cell.value.is_integer():
filtered_row.append(int(cell.value) if keep_cell else None)
elif isinstance(cell.value, float):
filtered_row.append(float(cell.value) if keep_cell else None)
else:
filtered_row.append(int(cell.value) if keep_cell else None)
else:
filtered_row.append(str(cell.value) if keep_cell else None)
filtered_data.append(filtered_row)
# DataFrame aus den gefilterten Daten erstellen
df_proper = pd.DataFrame(filtered_data)
dfs[sheet_name] = []
dfs[sheet_name] = df_proper
workbook.unload_sheet(sheet_name)
Example:
Column A | Column B |
---|---|
That can be |
Value 2 |
Value 3 | Value 4 |
The script should recognize that "That" is written in color red and "not" ist struck out.
I think the key to your problem is to get the format of every single character of a cell in the xls
file.
I make an example table like this:
The text value of the cell of (0, 0) is ab
, b
is red color.
The text value of the cell of (1, 0) is cd
, d
is struck out.
Here is my code reading all these styles. python 3.7.9
and xlrd 1.2.0
.
import xlrd
# accessing Column 'A' in this example
COL_IDX = 0
with xlrd.open_workbook('xls_file', formatting_info=True) as book:
sheet = book.sheet_by_index(0)
for row_idx in range(sheet.nrows):
text_cell = sheet.cell_value(row_idx, COL_IDX)
text_cell_xf = book.xf_list[sheet.cell_xf_index(row_idx, COL_IDX)]
# skip rows where cell is empty
if not text_cell:
continue
print(f'============\nText of ({row_idx}, {COL_IDX}) is `{text_cell}`')
text_cell_runlist = sheet.rich_text_runlist_map.get((row_idx, COL_IDX))
if text_cell_runlist:
print('============\nStyle segments of this cell:')
segments = []
for segment_idx in range(len(text_cell_runlist)):
start = text_cell_runlist[segment_idx][0]
# the last segment starts at given 'start' and ends at the end of the string
end = None
if segment_idx != len(text_cell_runlist) - 1:
end = text_cell_runlist[segment_idx + 1][0]
segment_text = text_cell[start:end]
segments.append({
'text': segment_text,
'font': book.font_list[text_cell_runlist[segment_idx][1]]
})
# segments did not start at beginning, assume cell starts with text styled as the cell
if text_cell_runlist[0][0] != 0:
segments.insert(0, {
'text': text_cell[:text_cell_runlist[0][0]],
'font': book.font_list[text_cell_xf.font_index]
})
for segment in segments:
print('------------\nTEXT:', segment['text'])
print('color:', segment['font'].colour_index)
print('bold:', segment['font'].bold)
print('struck out:', segment['font'].struck_out)
else:
print('------------\nCell whole style')
print('italic:', book.font_list[text_cell_xf.font_index].italic)
print('bold:', book.font_list[text_cell_xf.font_index].bold)
print('struck out:', book.font_list[text_cell_xf.font_index].struck_out)
Output:
============
Text of (0, 0) is `ab`
============
Style segments of this cell:
------------
TEXT: a
color: 32767
bold: 0
struck out: 0
------------
TEXT: b
color: 10
bold: 0
struck out: 0
------------
Cell whole style
italic: 0
bold: 0
struck out: 0
============
Text of (1, 0) is `cd`
============
Style segments of this cell:
------------
TEXT: c
color: 32767
bold: 0
struck out: 0
------------
TEXT: d
color: 32767
bold: 0
struck out: 1
------------
Cell whole style
italic: 0
bold: 0
struck out: 0