Search code examples
pythontry-exceptindex-errorpython-camelot

try except IndexError - I am not getting the desired result


I am trying to read PDF files and to convert them to clean data frames in Python. I loop through all relevant pages and want to append the data frames step-by-step to get one big table with all information.

Pages 32-33 need a slightly different treatment than the other pages (otherwise an IndexError is raised). I have tried to account for this by using try-except. However, after running the code, the information from pages 32-33 is missing from ledig['2000'] which is the resulting data frame.

I have tried to execute the code in the except block alone and it works (if I only read pp.32-33).

Any ideas?

As I am using try-except for the first time, it is of course possible that I misunderstood the concept in some way or other.

My code:

import camelot
ledig = {}
d = 2000
df_name = str(d)
tables = camelot.read_pdf('https://www.estv.admin.ch/dam/estv/de/dokumente/allgemein/Dokumentation/Zahlen_fakten/Steuerstatistiken/steuerbelastung_gemeinden/'+str(d)+'/BAE/Bruttoarbeitseinkommen%20Lediger.pdf.download.pdf/'+str(d)+'_bruttoarbeit_lediger_'+str(d)+'.pdf', pages="2-end", flavor='stream')
j = tables.n - 1
ledig[df_name] = pd.DataFrame()
for i in range(0,j):
    try:
        row = tables[i].df[tables[i].df.iloc[:,1] == '20'].index.tolist() #look for value "20", we want to move that to the top and delete rows above
        df = tables[i].df[row[0]:]
        new_header = df.iloc[0] #grab the first row for the header
        df = df[1:] #take the data less the header row
        df.columns = new_header #set the header row as the df header
        df = df.replace('-','0')
        df.iloc[:, 1:] = df.iloc[:, 1:].apply(pd.to_numeric)  
        ledig[df_name] = ledig[df_name].append(df)
        ledig[df_name] = ledig[df_name].dropna()
        ledig[df_name].drop_duplicates(keep=False,inplace=True) 
    except IndexError:
        row = tables[i].df[tables[i].df.iloc[:,2] == '20'].index.tolist() #look for value "20", we want to move that to the top and delete rows above
        df = tables[i].df[row[0]:]
        df = df.drop(df.columns[[1,3]], axis=1) 
        new_header = df.iloc[0] #grab the first row for the header
        df = df[1:] #take the data less the header row
        df.columns = new_header #set the header row as the df header
        df = df.replace('-','0')
        df.iloc[:, 1:] = df.iloc[:, 1:].apply(pd.to_numeric)  
        df.fillna(0, inplace = True)  
        ledig[df_name] = ledig[df_name].append(df)
        ledig[df_name] = ledig[df_name].dropna()
        ledig[df_name].drop_duplicates(keep=False,inplace=True)

Solution

  • Your usage of try/except is correct.

    The problem resides in df = df.drop(df.columns[[1,3]], axis=1): you shouldn't drop the 4th column (3).

    enter image description here

    If you use df = df.drop(df.columns[[1]], axis=1), tables from pages 32 and 33 are correctly appended.