Search code examples
pythonopenpyxlxlrd

error opening xlsx files in python


I am trying to open an xlsx file that is created by another system (and this is the format in which the data always comes, and is not in my control). I tried both openpyxl (v2.3.2) and xlrd (v1.0.0) (as well as pandas (v0.20.1) read_excel and pd.ExcelFile(), both of which are using xlrd, and so may be moot), and I am running into errors; plus not finding answers from my searches. Any help is appreciated.

xlrd code:

import xlrd
workbook = xlrd.open_workbook(r'C:/Temp/Data.xlsx')

Error:

Traceback (most recent call last):

  File "<ipython-input-3-9e5d87f720d0>", line 2, in <module>
    workbook = xlrd.open_workbook(r'C:/Temp/Data.xlsx')

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
    ragged_rows=ragged_rows,

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 833, in open_workbook_2007_xml
    x12sheet.process_stream(zflo, heading)

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 548, in own_process_stream
    self_do_row(elem)

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 685, in do_row
    self.sheet.put_cell(rowx, colx, None, float(tvalue), xf_index)

ValueError: could not convert string to float: 

openpyxl code:

import openpyxl
wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx')

Error:

Traceback (most recent call last):

  File "<ipython-input-2-6083ad2bc875>", line 1, in <module>
    wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx')

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\excel.py", line 234, in load_workbook
    parser.parse()

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 106, in parse
    dispatcher[tag_name](element)

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 243, in parse_row_dimensions
    self.parse_cell(cell)

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\reader\worksheet.py", line 188, in parse_cell
    value = _cast_number(value)

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 23, in _cast_number
    return long(value)

ValueError: invalid literal for int() with base 10: ' '

pandas code:

import pandas as pd
df = pd.read_excel(r'C:/Temp/Data.xlsx', sheetname='Sheet1')

Error:

Traceback (most recent call last):

  File "<ipython-input-5-b86ec98a4e9e>", line 2, in <module>
    df = pd.read_excel(r'C:/Temp/Data.xlsx', sheetname='Sheet1')

  File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\excel.py", line 200, in read_excel
    io = ExcelFile(io, engine=engine)

  File "C:\Program Files\Anaconda3\lib\site-packages\pandas\io\excel.py", line 257, in __init__
    self.book = xlrd.open_workbook(io)

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\__init__.py", line 422, in open_workbook
    ragged_rows=ragged_rows,

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 833, in open_workbook_2007_xml
    x12sheet.process_stream(zflo, heading)

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 548, in own_process_stream
    self_do_row(elem)

  File "C:\Program Files\Anaconda3\lib\site-packages\xlrd\xlsx.py", line 685, in do_row
    self.sheet.put_cell(rowx, colx, None, float(tvalue), xf_index)

ValueError: could not convert string to float: 

For what its worth, here is an example snippet of the input file: Input file example

I am guessing that the errors are coming from the first row having blanks beyond the first column - because the errors vanish when I delete the first two rows and . I cannot skip the first two rows, because I want to extract the value in cell A1. I would also like to force the values read to be string type, and will later convert to float with error checking. thanks!

===========

Update(Aug 9 10AM EDT): Using Charlie's suggestion, was able to open excel file in read only mode; and was able to read most of the contents - but still running into an error somewhere. new code (sorry it is not very pythonic - still a newbie):

wb = openpyxl.load_workbook(r'C:/Temp/Data.xlsx', read_only=True)
ws = wb['Sheet1']
ws.max_row = ws.max_column = None

i=1
for row in ws.rows:
    for cell in row:
        if i<2000:
            i += 1
            try:
                print(i, cell.value)
            except:
                print("error")

Error:

Traceback (most recent call last):

  File "<ipython-input-65-2e8f3cf2294a>", line 2, in <module>
    for row in ws.rows:

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\worksheet\read_only.py", line 125, in get_squared_range
    yield tuple(self._get_row(element, min_col, max_col))

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\worksheet\read_only.py", line 165, in _get_row
    value, data_type, style_id)

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 36, in __init__
    self.value = value

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 132, in value
    value = _cast_number(value)

  File "C:\Program Files\Anaconda3\lib\site-packages\openpyxl\cell\read_only.py", line 23, in _cast_number
    return long(value)

ValueError: invalid literal for int() with base 10: ' '

=========

Update2 (10:35AM): when i read the file without ws.max_row and ws.max_column set as None, the code was reading just one column, without errors. The value in cell A66 is "Generated from:". But when i read the file with ws.max_row and ws.max_column set as None, this particular cell is causing trouble. But I can read all other cells before that, and that will work fine for me, right now. thanks, @Charlie.


Solution

  • Sounds like the source file is probably corrupt and contains cells that with empty strings that are typed as numbers. You might be able to use openpyxl's read-only mode to skip the first tow rows.