Search code examples
pythonpandascsvxlsx

"Error: List index out of range" Over a list of 952 xlsx files, how edit and then save as csv


The goal is to combine the important information from sheet1 & sheet2 of 952 excel files. Then save them as csvs in path as a name according to a cell value. Thanks to the stackoverflow community this is mostly working. Now, it gives the error: "List index out of range". This happens around the half way point specifically 475 to 516 files saved correctly.

Would anyone be able to make this work for the entire list?

# https://stackoverflow.com/questions/44776793/copy-all-csv-files-in-a-directory-of-folders-to-one-folder-in-python
# https://stackoverflow.com/questions/59292999/modifying-multiple-csv-files-from-same-directory-in-python
# https://stackoverflow.com/questions/38101009/changing-multiple-column-names-but-not-all-of-them-pandas-python
# https://stackoverflow.com/questions/28465779/how-do-i-limit-the-amount-of-letters-in-a-string
# https://stackoverflow.com/questions/37952797/pandas-dataframe-column-name-remove-special-character
import glob
import pandas as pd

excel_files = glob.glob('data1/*.xlsx')
path = Path('data2')
for excel in excel_files:
    df1 = pd.read_excel(excel, sheet_name=0, dtype=str, index_col=None)
    df2 = pd.read_excel(excel, sheet_name=1, dtype=str, index_col=None)
    i = df1.iat[0,1]
    j = df1.iat[0,15]
    df2.rename(columns={'Date':'Date','Sales':i+j}, inplace=True)
    df2.columns=df2.columns.str.replace('(','') 
    df2.columns=df2.columns.str.replace('/','')
    df2.columns=df2.columns.str.replace(')','')
    df2.columns=df2.columns.str.replace(' ','-')
    df2.columns=df2.columns.str.replace('<','-')
    df2.columns=df2.columns.str.replace('>','-')
    k = df2.columns[1]
    l = (k)[:19]
    m = l + '.csv'
    df2.to_csv(path/m, encoding='utf-8', index=False)

->Edit below: Stack-trace as requested. Thanks for taking a look at least.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-3f727a40755a> in <module>
     23 for excel in excel_files:
     24     df1 = pd.read_excel(excel, sheet_name=0, dtype=str, index_col=None)
---> 25     df2 = pd.read_excel(excel, sheet_name=1, dtype=str, index_col=None)
     26     i = df1.iat[0,1]
     27     j = df1.iat[0,15]

~/anaconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    332         convert_float=convert_float,
    333         mangle_dupe_cols=mangle_dupe_cols,
--> 334         **kwds,
    335     )
    336 

~/anaconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    886             convert_float=convert_float,
    887             mangle_dupe_cols=mangle_dupe_cols,
--> 888             **kwds,
    889         )
    890 

~/anaconda3/lib/python3.7/site-packages/pandas/io/excel/_base.py in parse(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    439                 sheet = self.get_sheet_by_name(asheetname)
    440             else:  # assume an integer if not a string
--> 441                 sheet = self.get_sheet_by_index(asheetname)
    442 
    443             data = self.get_sheet_data(sheet, convert_float)

~/anaconda3/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py in get_sheet_by_index(self, index)
     44 
     45     def get_sheet_by_index(self, index):
---> 46         return self.book.sheet_by_index(index)
     47 
     48     def get_sheet_data(self, sheet, convert_float):

~/anaconda3/lib/python3.7/site-packages/xlrd/book.py in sheet_by_index(self, sheetx)
    464         :returns: A :class:`~xlrd.sheet.Sheet`.
    465         """
--> 466         return self._sheet_list[sheetx] or self.get_sheet(sheetx)
    467 
    468     def sheet_by_name(self, sheet_name):

IndexError: list index out of range

Solution

  • Chris's comment solved the problem:

    It means your particular excel workbook only has one sheet unlike the others and read_excel fails if parameter sheet_name is 1. If you want to handle that case, you need to wrap your pandas.read_excel call inside a try / except clause.