I have a pdf with a table in it, and trying to get that table into Pandas. Extracting pdf tables is notoriously difficult to get right, but I have found tabula works best. It is far and away the best I have seen, though still not perfect. I have this pdf table:
Note the headers of the table, and how they sometimes run onto the next row, and are generally not nice. I am prefacing the real problem, because all of the solutions I can see call for a better approach to creating the dataframe in the first place. This dataframe is the best I could get, but if you know how to get a better one out of tabula, please let me know.
I use tabula-py, and run this code:
holdingsDF = tbla.read_pdf(fileName, nospreadsheet=True)
There are many options, and I am not aware of all of them, but this seems to be the best I can do, after fiddling with the options. And the resulting DF is this:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Identifier Unnamed: 4 Curren Unnamed: 6 TAG0 Strategy
0 Asset Type Name Identifier NaN Quantity NaN Price NaN
1 NaN NaN NaN Type NaN cy NaN Name
2 Equity Akamai Technologies Inc AKAM US Equity TICKER (5,830) USD 65.000 AKAM
3 Convertible Bond AKAM 0 02/15/19 US00971TAG67 ISIN 1,595,000 USD 100.875 AKAM
4 Equity Advanced Micro Devices Inc AMD US Equity TICKER (181,500) USD 13.490 AMD
5 Convertible Bond AMD 2 1/8 09/01/26 US007903BD80 ISIN 1,650,000 USD 185.500 AMD
If you look, you can see the header has been sort of split amongst the first three rows. How do I fix this? Are there any clever, elegant solutions to take the 'best' of the three names for each column, and make that the column name?
Thanks!!
This is a pure pandas solution - assuming the dataframe is read exactly as pasted below.
df.columns = df.columns.str.replace('Unnamed.*', '') + \
df.iloc[0].fillna('') + \
df.iloc[1].fillna('')
df.drop([0,1], inplace=True)
1 Asset Type Name Identifier \
2 Equity Akamai Technologies Inc AKAM US Equity
3 Convertible Bond AKAM 0 02/15/19 US00971TAG67
4 Equity Advanced Micro Devices Inc AMD US Equity
5 Convertible Bond AMD 2 1/8 09/01/26 US007903BD80
1 IdentifierType Quantity Currency Price TAG0 StrategyName
2 TICKER (5,830) USD 65.000 AKAM
3 ISIN 1,595,000 USD 100.875 AKAM
4 TICKER (181,500) USD 13.490 AMD
5 ISIN 1,650,000 USD 185.500 None