I'm trying to read from an ods (Opendocument spreadsheet) document with the odfpy modules. So far I've been able to extract some data but whenever a cell contains non-standard input the script errors out with:
Traceback (most recent call last):
File "python/test.py", line 26, in <module>
print x.firstChild
File "/usr/lib/python2.7/site-packages/odf/element.py", line 247, in __str__
return self.data.encode()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 4: ordinal not in range(128)
I tried to force an encoding on the output but apparently it does not go well with print:
Traceback (most recent call last):
File "python/test.py", line 27, in <module>
print x.firstChild.encode('utf-8', 'ignore')
AttributeError: Text instance has no attribute 'encode'
What is the problem here and how could it be solved without editing the module code (which I'd like to avoid at all cost)? Is there an alternative to running encode on output that could work?
Here is my code:
from odf.opendocument import Spreadsheet
from odf.opendocument import load
from odf.table import Table,TableRow,TableCell
from odf.text import P
import sys,codecs
doc = load(sys.argv[1])
d = doc.spreadsheet
tables = d.getElementsByType(Table)
for table in tables:
tName = table.attributes[(u'urn:oasis:names:tc:opendocument:xmlns:table:1.0', u'name')]
print tName
rows = table.getElementsByType(TableRow)
for row in rows[:2]:
cells = row.getElementsByType(TableCell)
for cell in cells:
tps = cell.getElementsByType(P)
if len(tps)>0:
for x in tps:
#print x.firstChild
print x.firstChild.encode('utf-8', 'ignore')
Maybe you are not using the latest odfpy
, in the latest verion, the __str__
method of Text
is implemented as:
def __str__(self):
return self.data
Update odfpy
to the latest version, and modify your code as:
print x.firstChild.__str__().encode('utf-8', 'ignore')
UPDATE
This is another method for getting the raw unicode data for Text
: __unicode__
. So if you don't want to update odfpy
, modify your code as:
print x.firstChild.__unicode__().encode('utf-8', 'ignore')