I am using tabula-py to extract a table from a pdf document like this:
rows = tabula.read_pdf('bank_statement.pdf', pandas_options={"header":[0, 1, 2, 3, 4, 5]}, pages='all', stream=True, lattice=True)
rows
This gives an output like so:
[ 0
0 Customer Statement\rxxxxxxx\rP...
1 Print Date: April 12, 2020Address: 41 BAALE ST...
2 Period: January 1, 2020 April 12, 2020Openin...,
0
0 Customer Statement\xxxxxxxx\rP...
1 Print Date: April 12, 2020Address: 41 gg ST...,
0 1 2 3 4 5 \
0 03Jan2020 0 03Jan2020 NaN 50,000.00 52,064.00
1 10Jan2020 0 10Jan2020 25,000.00 NaN 27,064.00
2 10Jan2020 0 10Jan2020 25.00 NaN 27,039.00
3 10Jan2020 0 10Jan2020 1.25 NaN 27,037.75
4 20Jan2020 999921... 20Jan2020 10,000.00 NaN 17,037.75
5 23Jan2020 999984... 23Jan2020 4,050.00 NaN 12,987.75
6 23Jan2020 0 23Jan2020 1,000.00 NaN 11,987.75
7 24Jan2020 0 24Jan2020 2,000.00 NaN 9,987.75
8 24Jan2020 0 24Jan2020 NaN 30,000.00 39,987.75
6
0 TRANSFER BETWEEN\rCUSTOMERS Via GG from\r...
1 NS Instant Payment Outward\r000013200110121...
2 COMMISSION\r0000132001101218050000326...\rNIP ...
3 VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001
4 CASH WITHDRAWAL FROM\rOTHER ATM 210674 4420...
5 POS/WEB PURCHASE\rTRANSACTION 845061\r80405...
6 Airtime Purchase MBANKING\r101CT0000000001551...
7 Airtime Purchase MBANKING\r101CT0000000001552...
8 TRANSFER BETWEEN\rCUSTOMERS\r00001520012412113... ,
What I want from this pdf starts from index 2. So I run
rows[2]
And I get a dataframe that looks like this:
Now, I want indexes from 2 till the last index. I did
rows[2:]
But I am getting a list and not the expected dataframe.
[ 0 1 2 3 4 5 \
0 03Jan2020 0 03Jan2020 NaN 50,000.00 52,064.00
1 10Jan2020 0 10Jan2020 25,000.00 NaN 27,064.00
2 10Jan2020 0 10Jan2020 25.00 NaN 27,039.00
3 10Jan2020 0 10Jan2020 1.25 NaN 27,037.75
4 20Jan2020 999921... 20Jan2020 10,000.00 NaN 17,037.75
5 23Jan2020 999984... 23Jan2020 4,050.00 NaN 12,987.75
6 23Jan2020 0 23Jan2020 1,000.00 NaN 11,987.75
7 24Jan2020 0 24Jan2020 2,000.00 NaN 9,987.75
8 24Jan2020 0 24Jan2020 NaN 30,000.00 39,987.75
6
0 TRANSFER BETWEEN\rCUSTOMERS Via gg from\r...
1 bi Instant Payment Outward\r000013200110121...
2 COMMISSION\r0000132001101218050000326...\rNIP ...
3 VALUE ADDED TAX VAT ON NIP\rTRANSFER FOR 00001
4 CASH WITHDRAWAL FROM\rOTHER ATM 210674 4420...
5 POS/WEB PURCHASE\rTRANSACTION 845061\r80405...
Please do I solve this? I need a dataframe for indexes starting at 2 and onwards.
You are getting this behaviour because rows
is a list
and slicing a list produces another list. When you access an element at a specific index, you get the object at that index; in this case, a DataFrame object.
The pandas library ships with a concat function that can combine multiple DataFrame
objects into one -- I believe this is what you want to do -- such that you have:
import pandas as pd
df_combo = pd.concat([rows[2], rows[3], rows[4], rows[5] ...])
Even better:
df_combo = pd.concat(rows[2:])