I have a script that reads an xlsx excel file that was working fine until a week ago. The error message is:
xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document
By debugging the script, I've found the whole stack:
C:\MyFolder\MyScript.py", line 42, in PandasReadExcel
ef=pd.read_excel(excfile,sheetname,header,skiprows)
File "C:\Python\Python36\lib\site-packages\pandas\io\excel.py", line 191, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Python\Python36\lib\site-packages\pandas\io\excel.py", line 249, in __init__
self.book = xlrd.open_workbook(io)
File "C:\Python\Python36\lib\site-packages\xlrd\__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows, File "C:\Python\Python36\lib\site-packages\xlrd\book.py", line 87, in open_workbook_xls
ragged_rows=ragged_rows,
File "C:\Python\Python36\lib\site-packages\xlrd\book.py", line 595, in biff2_8_load
raise XLRDError("Can't find workbook in OLE2 compound document")
xlrd.biffh.XLRDError: Can't find workbook in OLE2 compound document
By reviewing similar cases here and in GitHub, I've found that this error usually occurs with xlsm files or password-protected files. But the concerned Excel workbook is not password protected and is a xlsx file. To my "unluck" I don't know the person who changes the file, it is updated regularly by a team that takes laboratory analysis, so I don't have any ideas of what they changed in the file. All I know is that I can open/edit that file with no problem.
Some threads suggest updating pandas or xlrd version (I am using pandas 0.19.2), which I am wanting to avoid, since the script runs in a remote server and updating the version would affect proper work of other scripts depending on this routine.
I thank anybody who has any clue on how to solve this problem.
After months struggling with this error, I've learned that the concerned files are being edited using an older version of Microsoft Office (namely Office 2007, in this very case). Then I decided to implement a clumsy workaround solution: Just open the files using a compatible Excel version, and save a copy in a different folder; then open the file using pandas read_excel function, it should open normally! To automate this task I wrote a powershell script just to open the original file and save the copy. This script must be executed according to how often the data is updated:
$FileName = "\\path\to\the\source\file.xlsx"
$FileNameCopy = "\\path\to\the\copy\file.xlsx"
$xl = New-Object -comobject Excel.Application
# repeat this for every file concerned
$wb = $xl.Workbooks.open("$FileName",3)
$wb.SaveAs($FileNameCopy)
$wb.Close($False)
$xl.Quit()
Now I can have my data loaded normally again.