Search code examples
pythonexcelreverse-engineeringopenpyxlxlrd

How to get each char format from XLSX cell with xlrd?


I have an excel with format like this:

enter image description here

and I read this cell using python:

wb = open_workbook(xlsx_path,formatting_info=True)
sheet = wb.sheet_by_name(sheet_name)
cell = sheet.cell(i,j)
print("cell.xf_index is", cell.xf_index)
fmt = wb.xf_list[cell.xf_index]
print("type(fmt) is", type(fmt))
print("Dumped Info:")
fmt.dump()

but what I get is wholly cell format:

enter image description here

how can I get each char format? thanks!


Solution

  • I have examined your spreadsheet and have created my own, which you can find here: https://drive.google.com/open?id=1WBm_tcFdlcckDgIvPdh-ezosm5pP4xXd

    Here is what mine looks like:

    enter image description here

    My fonts are sized 11, 22, 33 and 44 point, to make it easier to find them.

    I have been unable to find an API in either xlrd or in openpyxl that lets us read the multiple fonts stored in a single cell. However, I know that openpyxl offers this functionality. The problem is that this aspect of openpyxl is not well documented.

    I attempted to reverse engineer the file format. Recall the .xlsx files are ZIP files. So I unzipped it. Here is the contents of my Sheet1.xml file, prettyprinted:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" mc:Ignorable="x14ac xr xr2 xr3" xr:uid="{5F15C188-96B2-E44D-B28B-DB5F2AE0283E}">
      <dimension ref="A1"/>
      <sheetViews>
        <sheetView tabSelected="1" zoomScale="352" zoomScaleNormal="352" workbookViewId="0"/>
      </sheetViews>
      <sheetFormatPr baseColWidth="10" defaultRowHeight="16"/>
      <sheetData>
        <row r="1" spans="1:1" ht="57">
          <c r="A1" s="1" t="s">
            <v>0</v>
          </c>
        </row>
      </sheetData>
      <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
    </worksheet>
    

    For comparison, here is yours:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3="http://schemas.microsoft.com/office/spreadsheetml/2016/revision3" mc:Ignorable="x14ac xr xr2 xr3" xr:uid="{6A0965C4-B0BC-435B-932C-B62A46E63DFA}">
      <dimension ref="A1"/>
      <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0"/>
      </sheetViews>
      <sheetFormatPr defaultRowHeight="16.5" x14ac:dyDescent="0.25"/>
      <sheetData>
        <row r="1" spans="1:1" ht="25.5" x14ac:dyDescent="0.25">
          <c r="A1" t="s">
            <v>0</v>
          </c>
        </row>
      </sheetData>
      <phoneticPr fontId="1" type="noConversion"/>
      <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
    </worksheet>
    

    As you can see, the style information is not stored in Sheet1.xml. Instead, it is stored in styles.xml. It appears that Excel creates a named style for a cell that has specialized formatting applied. A named style can contain any of the formatting things that a cell can have, such as fonts and fills.

    So here is what the styles.xml file looks like for a blank Excel file:

    <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:x16r2="http://schemas.microsoft.com/office/spreadsheetml/2015/02/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" mc:Ignorable="x14ac x16r2 xr">
      <fonts count="1" x14ac:knownFonts="1">
        <font>
          <sz val="12"/>
          <color theme="1"/>
          <name val="Calibri"/>
          <family val="2"/>
          <scheme val="minor"/>
        </font>
      </fonts>
      <fills count="2">
        <fill>
          <patternFill patternType="none"/>
        </fill>
        <fill>
          <patternFill patternType="gray125"/>
        </fill>
      </fills>
      <borders count="1">
        <border>
          <left/>
          <right/>
          <top/>
          <bottom/>
          <diagonal/>
        </border>
      </borders>
      <cellStyleXfs count="1">
        <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
      </cellStyleXfs>
      <cellXfs count="1">
        <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>
      </cellXfs>
      <cellStyles count="1">
        <cellStyle name="Normal" xfId="0" builtinId="0"/>
      </cellStyles>
      <dxfs count="0"/>
      <tableStyles count="0" defaultTableStyle="TableStyleMedium2" defaultPivotStyle="PivotStyleLight16"/>
      <extLst>
        <ext xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" uri="{EB79DEF2-80B8-43e5-95BD-54CBDDF9020C}">
          <x14:slicerStyles defaultSlicerStyle="SlicerStyleLight1"/>
        </ext>
        <ext xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" uri="{9260A510-F301-46a8-8635-F512D64BE5F5}">
          <x15:timelineStyles defaultTimelineStyle="TimeSlicerStyleLight1"/>
        </ext>
      </extLst>
    </styleSheet>
    

    And here is the one for my file:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:x16r2="http://schemas.microsoft.com/office/spreadsheetml/2015/02/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" mc:Ignorable="x14ac x16r2 xr">
      <fonts count="6">
        <font>
          <sz val="12"/>
          <color theme="1"/>
          <name val="Calibri"/>
          <family val="2"/>
          <scheme val="minor"/>
        </font>
        <font>
          <sz val="11"/>
          <color theme="1"/>
          <name val="Calibri (Body)_x0000_"/>
        </font>
        <font>
          <sz val="12"/>
          <color theme="1"/>
          <name val="Calibri (Body)_x0000_"/>
        </font>
        <font>
          <sz val="44"/>
          <color rgb="FFFF0000"/>
          <name val="Calibri (Body)_x0000_"/>
        </font>
        <font>
          <sz val="33"/>
          <color rgb="FF0070C0"/>
          <name val="Calibri (Body)_x0000_"/>
        </font>
        <font>
          <sz val="22"/>
          <color rgb="FF7030A0"/>
          <name val="Calibri (Body)_x0000_"/>
        </font>
      </fonts>
      <fills count="2">
        <fill>
          <patternFill patternType="none"/>
        </fill>
        <fill>
          <patternFill patternType="gray125"/>
        </fill>
      </fills>
      <borders count="1">
        <border>
          <left/>
          <right/>
          <top/>
          <bottom/>
          <diagonal/>
        </border>
      </borders>
      <cellStyleXfs count="1">
        <xf numFmtId="0" fontId="0" fillId="0" borderId="0"/>
      </cellStyleXfs>
      <cellXfs count="2">
        <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>
        <xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" applyFont="1"/>
      </cellXfs>
      <cellStyles count="1">
        <cellStyle name="Normal" xfId="0" builtinId="0"/>
      </cellStyles>
      <dxfs count="0"/>
      <tableStyles count="0" defaultTableStyle="TableStyleMedium2" defaultPivotStyle="PivotStyleLight16"/>
      <extLst>
        <ext xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" uri="{EB79DEF2-80B8-43e5-95BD-54CBDDF9020C}">
          <x14:slicerStyles defaultSlicerStyle="SlicerStyleLight1"/>
        </ext>
        <ext xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" uri="{9260A510-F301-46a8-8635-F512D64BE5F5}">
          <x15:timelineStyles defaultTimelineStyle="TimeSlicerStyleLight1"/>
        </ext>
      </extLst>
    </styleSheet>
    

    And here is yours:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:x16r2="http://schemas.microsoft.com/office/spreadsheetml/2015/02/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" mc:Ignorable="x14ac x16r2 xr">
      <fonts count="4" x14ac:knownFonts="1">
        <font>
          <sz val="12"/>
          <color theme="1"/>
          <name val="新細明體"/>
          <family val="2"/>
          <charset val="136"/>
          <scheme val="minor"/>
        </font>
        <font>
          <sz val="9"/>
          <name val="新細明體"/>
          <family val="2"/>
          <charset val="136"/>
          <scheme val="minor"/>
        </font>
        <font>
          <sz val="18"/>
          <color rgb="FFFF0000"/>
          <name val="新細明體"/>
          <family val="1"/>
          <charset val="136"/>
          <scheme val="minor"/>
        </font>
        <font>
          <b/>
          <sz val="16"/>
          <color theme="1"/>
          <name val="新細明體"/>
          <family val="1"/>
          <charset val="136"/>
          <scheme val="minor"/>
        </font>
      </fonts>
      <fills count="2">
        <fill>
          <patternFill patternType="none"/>
        </fill>
        <fill>
          <patternFill patternType="gray125"/>
        </fill>
      </fills>
      <borders count="1">
        <border>
          <left/>
          <right/>
          <top/>
          <bottom/>
          <diagonal/>
        </border>
      </borders>
      <cellStyleXfs count="1">
        <xf numFmtId="0" fontId="0" fillId="0" borderId="0">
          <alignment vertical="center"/>
        </xf>
      </cellStyleXfs>
      <cellXfs count="1">
        <xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0">
          <alignment vertical="center"/>
        </xf>
      </cellXfs>
      <cellStyles count="1">
        <cellStyle name="一般" xfId="0" builtinId="0"/>
      </cellStyles>
      <dxfs count="0"/>
      <tableStyles count="0" defaultTableStyle="TableStyleMedium2" defaultPivotStyle="PivotStyleLight16"/>
      <extLst>
        <ext xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"
             uri="{EB79DEF2-80B8-43e5-95BD-54CBDDF9020C}">
          <x14:slicerStyles defaultSlicerStyle="SlicerStyleLight1"/>
        </ext>
        <ext xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"
             uri="{9260A510-F301-46a8-8635-F512D64BE5F5}">
          <x15:timelineStyles defaultTimelineStyle="TimeSlicerStyleLight1"/>
        </ext>
      </extLst>
    </styleSheet>
    

    So now you can see why my file is easier to work with, at least for me: the font sizes 11, 22, 33 and 44 are easy to find in the XML! (Mine also has names in English (which I speak), and not in Chinese (which I don't)).

    So in my example there are six fonts (I'm not clear what purpose count="6" serves), and my example has fonts 0, 4, 5, and 3, in that order.

    My problem at this point is I can't find where the list of fonts, in order, gets bound to the cell.

    Oh, here's a program I wrote to decode the spreadsheet with openpyxl, but I didn't get very far:

    from openpyxl import Workbook, load_workbook
    import sys
    
    if __name__=="__main__":
        wb = load_workbook(sys.argv[1])
        for ws in wb.worksheets:
            print(f"Sheet {ws} max rows: {ws.max_row} max cols: {ws.max_column}")
            for row in range(0,ws.max_row):
                for column in range(0,ws.max_column):
                    # Note that openpyxl starts at 1 for rows and columns
                    cell = ws.cell(row=row+1, column=column+1)
                    print(cell)
                    print(dir(cell))
                    for attr in dir(cell):
                        if attr[0]!='_':
                            print(f"cell {attr} = {getattr(cell,attr)}")
                            print("")
            print("ws['A1'].style=",ws['A1'].style)
        print(dir(wb))
        print(wb.named_styles)
        style = wb.named_styles[0]
        print(style)
    

    and here is the output:

    Sheet <Worksheet "Sheet1"> max rows: 1 max cols: 1
    <Cell 'Sheet1'.A1>
    ['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_bind_value', '_comment', '_hyperlink', '_style', '_value', 'alignment', 'base_date', 'border', 'check_error', 'check_string', 'col_idx', 'column', 'column_letter', 'comment', 'coordinate', 'data_type', 'encoding', 'fill', 'font', 'guess_types', 'has_style', 'hyperlink', 'internal_value', 'is_date', 'number_format', 'offset', 'parent', 'pivotButton', 'protection', 'quotePrefix', 'row', 'style', 'style_id', 'value']
    cell alignment = <openpyxl.styles.alignment.Alignment object>
    Parameters:
    horizontal=None, vertical=None, textRotation=0, wrapText=None, shrinkToFit=None, indent=0.0, relativeIndent=0.0, justifyLastLine=None, readingOrder=0.0
    
    cell base_date = 2415018.5
    
    cell border = <openpyxl.styles.borders.Border object>
    Parameters:
    outline=True, diagonalUp=False, diagonalDown=False, start=None, end=None, left=<openpyxl.styles.borders.Side object>
    Parameters:
    style=None, color=None, right=<openpyxl.styles.borders.Side object>
    Parameters:
    style=None, color=None, top=<openpyxl.styles.borders.Side object>
    Parameters:
    style=None, color=None, bottom=<openpyxl.styles.borders.Side object>
    Parameters:
    style=None, color=None, diagonal=<openpyxl.styles.borders.Side object>
    Parameters:
    style=None, color=None, vertical=None, horizontal=None
    
    cell check_error = <bound method Cell.check_error of <Cell 'Sheet1'.A1>>
    
    cell check_string = <bound method Cell.check_string of <Cell 'Sheet1'.A1>>
    
    cell col_idx = 1
    
    cell column = 1
    
    cell column_letter = A
    
    cell comment = None
    
    cell coordinate = A1
    
    cell data_type = s
    
    cell encoding = utf-8
    
    cell fill = <openpyxl.styles.fills.PatternFill object>
    Parameters:
    patternType=None, fgColor=<openpyxl.styles.colors.Color object>
    Parameters:
    rgb='00000000', indexed=None, auto=None, theme=None, tint=0.0, type='rgb', bgColor=<openpyxl.styles.colors.Color object>
    Parameters:
    rgb='00000000', indexed=None, auto=None, theme=None, tint=0.0, type='rgb'
    
    cell font = <openpyxl.styles.fonts.Font object>
    Parameters:
    name='Calibri (Body)_x0000_', charset=None, family=None, b=False, i=False, strike=None, outline=None, shadow=None, condense=None, color=<openpyxl.styles.colors.Color object>
    Parameters:
    rgb=None, indexed=None, auto=None, theme=1, tint=0.0, type='theme', extend=None, sz=12.0, u=None, vertAlign=None, scheme=None
    
    cell guess_types = False
    
    cell has_style = True
    
    cell hyperlink = None
    
    cell internal_value = test
    
    cell is_date = False
    
    cell number_format = General
    
    cell offset = <bound method Cell.offset of <Cell 'Sheet1'.A1>>
    
    cell parent = <Worksheet "Sheet1">
    
    cell pivotButton = False
    
    cell protection = <openpyxl.styles.protection.Protection object>
    Parameters:
    locked=True, hidden=False
    
    cell quotePrefix = False
    
    cell row = 1
    
    cell style = Normal
    
    cell style_id = 1
    
    cell value = test
    
    ws['A1'].style= Normal
    ['_Workbook__write_only', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_active_sheet_index', '_add_sheet', '_alignments', '_borders', '_cell_styles', '_colors', '_data_only', '_date_formats', '_differential_styles', '_external_links', '_fills', '_fonts', '_named_styles', '_number_formats', '_pivots', '_protections', '_read_only', '_setup_styles', '_sheets', '_table_styles', 'active', 'add_named_range', 'add_named_style', 'calculation', 'chartsheets', 'close', 'code_name', 'copy_worksheet', 'create_chartsheet', 'create_named_range', 'create_sheet', 'data_only', 'defined_names', 'encoding', 'epoch', 'excel_base_date', 'get_active_sheet', 'get_index', 'get_named_range', 'get_named_ranges', 'get_sheet_by_name', 'get_sheet_names', 'guess_types', 'index', 'is_template', 'iso_dates', 'loaded_theme', 'mime_type', 'move_sheet', 'named_styles', 'path', 'properties', 'read_only', 'rels', 'remove', 'remove_named_range', 'remove_sheet', 'save', 'security', 'shared_strings', 'sheetnames', 'style_names', 'template', 'vba_archive', 'views', 'worksheets', 'write_only']
    ['Normal']
    Normal
    

    Here are some references I consulted which I thought would help: