I have an Excel file (data.xlsx) with the following pattern and I need to read a value from a textbox inside that Excel file.
I am currently using pandas library and I tried to obtain the value, but unfortunately could not find any API for that objective.
Does anyone know how could this be done?
More info:
My question is a duplicate of this sister question from Java.
Edit:
I have provided a solution for anyone who wonders how to search for shapes (and possibly all other items) in an excel file, manually (that is, without external modules from pip). It's actually pretty simple. See my comment.
Thanks for all the kind help, but I worked this out on my own.
I got this to work using zipfile module. Apparently, Excel is actually a suite that works on compressed XML files (changing the *.xlsx to *.zip reveals the contents of the file) when saving and reading from *.xlsx, so I could easily search the text needed with the comfort of XML.
Here's the module I made. By calling Sheet('path/to/sheet.xlsx').shapes.text
, you can now easily find the text inside textboxes:
import zipfile as z
class Sheet(str):
@property
def shapes(this):
s = z.ZipFile(this)
p='xl/drawings/drawing1.xml' # shapes path, *.xlsx default
p='drs/shapexml.xml' # shapes path, *.xls default
return XML(s.read(p))
class XML(object):
def __init__(self, value):
self.value = str(value)
def __repr__(self):
return repr(self.value)
def __getitem__(self, i):
return self.value[i]
def tag_content(self, tag):
return [XML(i) for i in self.value.split(tag)[1::2]]
@property
def text(self):
t = self.tag_content('xdr:txBody') # list of XML codes, each containing a seperate textboxes, messy (with extra xml that is)
l = [i.tag_content('a:p>') for i in t] # split into sublists by line breaks (inside the textbox), messy
w = [[[h[1:-2] for h in i.tag_content('a:t')] if i else ['\n'] for i in j] for j in l] # clean into sublists by cell-by-cell basis (and mind empty lines)
l = [[''.join(i) for i in j] for j in w] # join lines overlapping multiple cells into one sublist
return ['\n'.join(j) for j in l] # join sublists of lines into strings seperated by newline char
So now the pattern provided in my question will be output as ['comments extra']
, while a pattern such as:
This is
Text in a textbox on
a sheet
And this is another text box somewhere else
Regardless of the overlapped cells
Will be output as ['This is\nText in a textbox on\na sheet','And this is another text box somewhere else\nRegardless of the overlapped cells']
.
You're welcome.