I am trying to open an Excel file (.xls) using xlrd. This is a summary of the code I am using:
import xlrd
workbook = xlrd.open_workbook('thefile.xls')
This works for most files, but fails for files I get from a specific organization. The error I get when I try to open Excel files from this organization follows.
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook
ragged_rows=ragged_rows,
File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
bk.parse_globals()
File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals
self.handle_writeaccess(data)
File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess
strg = unpack_unicode(data, 0, lenlen=2)
File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode
strg = unicode(rawstrg, 'utf_16_le')
File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data
This looks as if xlrd is trying to open an Excel file encoded in something other than UTF-16. How can I avoid this error? Is the file being written in a flawed way, or is there just a specific character that is causing the problem? If I open and re-save the Excel file, xlrd opens the file without a problem.
I have tried opening the workbook with different encoding overrides but this doesn't work either.
The file I am trying to open is available here:
Issue reported here: https://github.com/python-excel/xlrd/issues/128
What are they using to generate that file ?
They are using some Java Excel API (see below, link here), probably on an IBM mainframe or similar.
From the stack trace the writeaccess information can't decoding into Unicode because the @ character.
For more information on the writeaccess information of the XLS fileformat see 5.112 WRITEACCESS or Page 277.
This field contains the username of the user that has saved the file.
import xlrd
dump = xlrd.dump('thefile.xls')
Running xlrd.dump on the original file gives
36: 005c WRITEACCESS len = 0070 (112)
40: d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40 ????@?????@???@@
56: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@
72: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@
88: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@
104: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@
120: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@
136: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 @@@@@@@@@@@@@@@@
After resaving it with Excel or in my case LibreOffice Calc the write access information is overwritten with something like
36: 005c WRITEACCESS len = 0070 (112)
40: 04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20 ?~~Calc
56: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
72: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
88: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
104: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
120: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
136: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
Based on the spaces being encoded as 40, I believe the encoding is EBCDIC, and when we convert d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40
to EBCDIC we get Java Excel API
.
So yes the file is being written in a flawed way in the case of BIFF8 and higher it should be a unicode string, and in BIFF3 to BIFF5, it should be a byte string in the encoding in the CODEPAGE information which is
152: 0042 CODEPAGE len = 0002 (2)
156: 12 52 ?R
1252 is Windows CP-1252 (Latin I) (BIFF4-BIFF5), which is not EBCDIC_037.
The fact the xlrd tried to use unicode, means that it determined the version of the file to be BIFF8.
In this case, you have two options
Fix the file before opening it with xlrd. You could check using dump to a file that isn't standard out, and then if it is the case, you can overwrite the writeaccess information with xlutils.save or another library.
Patch xlrd to handle your special case, in handle_writeaccess
adding a try block and setting strg to empty string on unpack_unicode failure.
The following snippet
def handle_writeaccess(self, data):
DEBUG = 0
if self.biff_version < 80:
if not self.encoding:
self.raw_user_name = True
self.user_name = data
return
strg = unpack_string(data, 0, self.encoding, lenlen=1)
else:
try:
strg = unpack_unicode(data, 0, lenlen=2)
except:
strg = ""
if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg)
strg = strg.rstrip()
self.user_name = strg
with
workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")
Seems to open the file successfully.
Without the encoding override it complains ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010