Search code examples
pythonencodingxlrdimport-from-excel

Encoding error when opening an Excel file with xlrd


I am trying to open an Excel file (.xls) using xlrd. This is a summary of the code I am using:

import xlrd
workbook = xlrd.open_workbook('thefile.xls')

This works for most files, but fails for files I get from a specific organization. The error I get when I try to open Excel files from this organization follows.

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data

This looks as if xlrd is trying to open an Excel file encoded in something other than UTF-16. How can I avoid this error? Is the file being written in a flawed way, or is there just a specific character that is causing the problem? If I open and re-save the Excel file, xlrd opens the file without a problem.

I have tried opening the workbook with different encoding overrides but this doesn't work either.

The file I am trying to open is available here:

https://dl.dropboxusercontent.com/u/6779408/Stackoverflow/AEPUsageHistoryDetail_RequestID_00183816.xls

Issue reported here: https://github.com/python-excel/xlrd/issues/128


Solution

  • What are they using to generate that file ?

    They are using some Java Excel API (see below, link here), probably on an IBM mainframe or similar.

    From the stack trace the writeaccess information can't decoding into Unicode because the @ character.

    For more information on the writeaccess information of the XLS fileformat see 5.112 WRITEACCESS or Page 277.

    This field contains the username of the user that has saved the file.

    import xlrd
    dump = xlrd.dump('thefile.xls')
    

    Running xlrd.dump on the original file gives

       36: 005c WRITEACCESS len = 0070 (112)
       40:      d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40  ????@?????@???@@
       56:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
       72:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
       88:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
      104:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
      120:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
      136:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
    

    After resaving it with Excel or in my case LibreOffice Calc the write access information is overwritten with something like

     36: 005c WRITEACCESS len = 0070 (112)
     40:      04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20  ?~~Calc         
     56:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
     72:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
     88:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
    104:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
    120:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
    136:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
    

    Based on the spaces being encoded as 40, I believe the encoding is EBCDIC, and when we convert d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40 to EBCDIC we get Java Excel API.

    So yes the file is being written in a flawed way in the case of BIFF8 and higher it should be a unicode string, and in BIFF3 to BIFF5, it should be a byte string in the encoding in the CODEPAGE information which is

     152: 0042 CODEPAGE len = 0002 (2)
     156:      12 52                                            ?R
    

    1252 is Windows CP-1252 (Latin I) (BIFF4-BIFF5), which is not EBCDIC_037.

    The fact the xlrd tried to use unicode, means that it determined the version of the file to be BIFF8.

    In this case, you have two options

    1. Fix the file before opening it with xlrd. You could check using dump to a file that isn't standard out, and then if it is the case, you can overwrite the writeaccess information with xlutils.save or another library.

    2. Patch xlrd to handle your special case, in handle_writeaccess adding a try block and setting strg to empty string on unpack_unicode failure.

    The following snippet

     def handle_writeaccess(self, data):
            DEBUG = 0
            if self.biff_version < 80:
                if not self.encoding:
                    self.raw_user_name = True
                    self.user_name = data
                    return
                strg = unpack_string(data, 0, self.encoding, lenlen=1)
            else:
                try:
                    strg = unpack_unicode(data, 0, lenlen=2)
                except:
                    strg = ""
            if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg)
            strg = strg.rstrip()
            self.user_name = strg
    

    with

    workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")
    

    Seems to open the file successfully.

    Without the encoding override it complains ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010