Search code examples
pythonexcelpandasdata-analysis

pandas.read_excel imports wrong values for a column


I am using pandas.read_excel to import an excel file into a DataFrame. This is the Code...

#!/usr/bin/python
import pandas as pd
file = 'sample.xls'
df = pd.read_excel(file, sheetname=0, skiprows=7)

This imports the file but with the below warning...

WARNING *** OLE2 stream 'SSCS': expected size 128640, actual size 512

And When when I print the dataframe, I see that the last column has completely wrong values(instead of actual values from that column, it has shows 4 for every row.

I am attaching a screen grab of the xls file. Sample Xls file 1st page screen grab


Solution

  • If you are using Windows, you could use Excel itself to modify all of the XLS files before loading them with Pandas. The following script will automatically unhide all of the columns in all XLS files found in a given folder:

    import win32com.client as win32
    import glob
    
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    
    for xls in glob.glob(r"C:\My Path\*.xls"):
        print xls
        wb = excel.Workbooks.Open(xls)
        ws = wb.Worksheets(1)
        ws.Columns.EntireColumn.Hidden = False
        excel.DisplayAlerts = False     # Allow file overwrite
        wb.Close(True)
    
    excel.Application.Quit()
    

    You might want to make a copy of your XLS files before doing this as it will be done in place. Alternatively, you could use wb.SaveAs() to specify a different output location.