Search code examples
pythonlistpython-3.7tabula-py

Accessing indexes in a list


I am using tabula-py to extract a table from a pdf document like this:

rows = tabula.read_pdf('bank_statement.pdf', pandas_options={"header":[0, 1, 2, 3, 4, 5]}, pages='all', stream=True, lattice=True) 

rows

This gives an output like so:

[                                                   0
 0  Customer Statement\rxxxxxxx\rP...
 1  Print Date: April 12, 2020Address: 41 BAALE ST...
 2  Period: January 1, 2020 ­ April 12, 2020Openin...,
                                                    0
 0  Customer Statement\xxxxxxxx\rP...
 1  Print Date: April 12, 2020Address: 41 gg ST...,
              0          1            2          3          4          5  \
 0  03­Jan­2020          0  03­Jan­2020        NaN  50,000.00  52,064.00   
 1  10­Jan­2020          0  10­Jan­2020  25,000.00        NaN  27,064.00   
 2  10­Jan­2020          0  10­Jan­2020      25.00        NaN  27,039.00   
 3  10­Jan­2020          0  10­Jan­2020       1.25        NaN  27,037.75   
 4  20­Jan­2020  999921...  20­Jan­2020  10,000.00        NaN  17,037.75   
 5  23­Jan­2020  999984...  23­Jan­2020   4,050.00        NaN  12,987.75   
 6  23­Jan­2020          0  23­Jan­2020   1,000.00        NaN  11,987.75   
 7  24­Jan­2020          0  24­Jan­2020   2,000.00        NaN   9,987.75   
 8  24­Jan­2020          0  24­Jan­2020        NaN  30,000.00  39,987.75   

                                                    6  
 0  TRANSFER BETWEEN\rCUSTOMERS Via GG from\r...  
 1  NS Instant Payment Outward\r000013200110121...  
 2  COMMISSION\r0000132001101218050000326...\rNIP ...  
 3     VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001  
 4  CASH WITHDRAWAL FROM\rOTHER ATM ­210674­ ­4420...  
 5  POS/WEB PURCHASE\rTRANSACTION ­845061­\r­80405...  
 6  Airtime Purchase MBANKING­\r101CT0000000001551...  
 7  Airtime Purchase MBANKING­\r101CT0000000001552...  
 8  TRANSFER BETWEEN\rCUSTOMERS\r00001520012412113...  ,

What I want from this pdf starts from index 2. So I run

rows[2]

And I get a dataframe that looks like this:

enter image description here

Now, I want indexes from 2 till the last index. I did

rows[2:]

But I am getting a list and not the expected dataframe.

[             0          1            2          3          4          5  \
 0  03­Jan­2020          0  03­Jan­2020        NaN  50,000.00  52,064.00   
 1  10­Jan­2020          0  10­Jan­2020  25,000.00        NaN  27,064.00   
 2  10­Jan­2020          0  10­Jan­2020      25.00        NaN  27,039.00   
 3  10­Jan­2020          0  10­Jan­2020       1.25        NaN  27,037.75   
 4  20­Jan­2020  999921...  20­Jan­2020  10,000.00        NaN  17,037.75   
 5  23­Jan­2020  999984...  23­Jan­2020   4,050.00        NaN  12,987.75   
 6  23­Jan­2020          0  23­Jan­2020   1,000.00        NaN  11,987.75   
 7  24­Jan­2020          0  24­Jan­2020   2,000.00        NaN   9,987.75   
 8  24­Jan­2020          0  24­Jan­2020        NaN  30,000.00  39,987.75   

                                                    6  
 0  TRANSFER BETWEEN\rCUSTOMERS Via gg from\r...  
 1  bi Instant Payment Outward\r000013200110121...  
 2  COMMISSION\r0000132001101218050000326...\rNIP ...  
 3     VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001  
 4  CASH WITHDRAWAL FROM\rOTHER ATM ­210674­ ­4420...  
 5  POS/WEB PURCHASE\rTRANSACTION ­845061­\r­80405...

Please do I solve this? I need a dataframe for indexes starting at 2 and onwards.


Solution

  • You are getting this behaviour because rows is a list and slicing a list produces another list. When you access an element at a specific index, you get the object at that index; in this case, a DataFrame object.

    The pandas library ships with a concat function that can combine multiple DataFrame objects into one -- I believe this is what you want to do -- such that you have:

    import pandas as pd
    
    
    df_combo = pd.concat([rows[2], rows[3], rows[4], rows[5] ...])
    

    Even better:

    df_combo = pd.concat(rows[2:])