Search code examples
pythonexcelpandasxlrd

Pandas suddenly cannot open Excel file (can't find workbook in OLE2 compound document


I have a script that reads an xlsx excel file that was working fine until a week ago. The error message is:

xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document

By debugging the script, I've found the whole stack:

C:\MyFolder\MyScript.py", line 42, in PandasReadExcel 
   ef=pd.read_excel(excfile,sheetname,header,skiprows) 
File "C:\Python\Python36\lib\site-packages\pandas\io\excel.py", line 191, in read_excel 
   io = ExcelFile(io, engine=engine) 
File "C:\Python\Python36\lib\site-packages\pandas\io\excel.py", line 249, in __init__ 
   self.book = xlrd.open_workbook(io) 
File "C:\Python\Python36\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook 
   ragged_rows=ragged_rows, File "C:\Python\Python36\lib\site-packages\xlrd\book.py", line 87, in open_workbook_xls 
   ragged_rows=ragged_rows, 
File "C:\Python\Python36\lib\site-packages\xlrd\book.py", line 595, in biff2_8_load 
   raise XLRDError("Can't find workbook in OLE2 compound document") 
xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document

By reviewing similar cases here and in GitHub, I've found that this error usually occurs with xlsm files or password-protected files. But the concerned Excel workbook is not password protected and is a xlsx file. To my "unluck" I don't know the person who changes the file, it is updated regularly by a team that takes laboratory analysis, so I don't have any ideas of what they changed in the file. All I know is that I can open/edit that file with no problem.

Some threads suggest updating pandas or xlrd version (I am using pandas 0.19.2), which I am wanting to avoid, since the script runs in a remote server and updating the version would affect proper work of other scripts depending on this routine.

I thank anybody who has any clue on how to solve this problem.


Solution

  • After months struggling with this error, I've learned that the concerned files are being edited using an older version of Microsoft Office (namely Office 2007, in this very case). Then I decided to implement a clumsy workaround solution: Just open the files using a compatible Excel version, and save a copy in a different folder; then open the file using pandas read_excel function, it should open normally! To automate this task I wrote a powershell script just to open the original file and save the copy. This script must be executed according to how often the data is updated:

    $FileName = "\\path\to\the\source\file.xlsx"
    $FileNameCopy = "\\path\to\the\copy\file.xlsx"
    
    $xl = New-Object -comobject Excel.Application
    # repeat this for every file concerned
    $wb = $xl.Workbooks.open("$FileName",3)
    $wb.SaveAs($FileNameCopy)
    $wb.Close($False)
    
    $xl.Quit()
    

    Now I can have my data loaded normally again.