Search code examples
pythonexcelpandasopenpyxlxlrd

Obtain textbox value from Excel in Python


I have an Excel file (data.xlsx) with the following pattern and I need to read a value from a textbox inside that Excel file.

I am currently using pandas library and I tried to obtain the value, but unfortunately could not find any API for that objective.

Does anyone know how could this be done?

More info:

My question is a duplicate of this sister question from Java.

Edit:

I have provided a solution for anyone who wonders how to search for shapes (and possibly all other items) in an excel file, manually (that is, without external modules from pip). It's actually pretty simple. See my comment.


Solution

  • Thanks for all the kind help, but I worked this out on my own.

    I got this to work using zipfile module. Apparently, Excel is actually a suite that works on compressed XML files (changing the *.xlsx to *.zip reveals the contents of the file) when saving and reading from *.xlsx, so I could easily search the text needed with the comfort of XML.

    Here's the module I made. By calling Sheet('path/to/sheet.xlsx').shapes.text, you can now easily find the text inside textboxes:

    import zipfile as z
    
    
    class Sheet(str):
        @property
        def shapes(this):
            s = z.ZipFile(this)
            p='xl/drawings/drawing1.xml'  # shapes path, *.xlsx default
            p='drs/shapexml.xml'  # shapes path, *.xls default
            return XML(s.read(p))
    
    
    class XML(object):
        def __init__(self, value):
            self.value = str(value)
    
        def __repr__(self):
            return repr(self.value)
    
        def __getitem__(self, i):
            return self.value[i]
    
        def tag_content(self, tag):
            return [XML(i) for i in self.value.split(tag)[1::2]]
    
        @property
        def text(self):
            t = self.tag_content('xdr:txBody')  # list of XML codes, each containing a seperate textboxes, messy (with extra xml that is)
            l = [i.tag_content('a:p>') for i in t]  # split into sublists by line breaks (inside the textbox), messy
            w = [[[h[1:-2] for h in i.tag_content('a:t')] if i else ['\n'] for i in j] for j in l]  # clean into sublists by cell-by-cell basis (and mind empty lines)
            l = [[''.join(i) for i in j] for j in w]  #  join lines overlapping multiple cells into one sublist
            return ['\n'.join(j) for j in l]  #  join sublists of lines into strings seperated by newline char
    

    So now the pattern provided in my question will be output as ['comments extra'], while a pattern such as:

    1. This is

      Text in a textbox on

      a sheet

    2. And this is another text box somewhere else

      Regardless of the overlapped cells

    Will be output as ['This is\nText in a textbox on\na sheet','And this is another text box somewhere else\nRegardless of the overlapped cells'].

    You're welcome.