Search code examples
pythonpandasdataframetabula

How to rename unnamed columns in Pandas?


I have a pdf with a table in it, and trying to get that table into Pandas. Extracting pdf tables is notoriously difficult to get right, but I have found tabula works best. It is far and away the best I have seen, though still not perfect. I have this pdf table:

Table

Note the headers of the table, and how they sometimes run onto the next row, and are generally not nice. I am prefacing the real problem, because all of the solutions I can see call for a better approach to creating the dataframe in the first place. This dataframe is the best I could get, but if you know how to get a better one out of tabula, please let me know.

I use tabula-py, and run this code:

holdingsDF = tbla.read_pdf(fileName, nospreadsheet=True)

There are many options, and I am not aware of all of them, but this seems to be the best I can do, after fiddling with the options. And the resulting DF is this:

         Unnamed: 0                  Unnamed: 1      Unnamed: 2 Identifier Unnamed: 4 Curren Unnamed: 6 TAG0 Strategy  
0        Asset Type                        Name      Identifier        NaN   Quantity    NaN      Price           NaN  
1               NaN                         NaN             NaN       Type        NaN     cy        NaN          Name  
2            Equity     Akamai Technologies Inc  AKAM US Equity     TICKER    (5,830)    USD     65.000          AKAM   
3  Convertible Bond             AKAM 0 02/15/19    US00971TAG67       ISIN  1,595,000    USD    100.875          AKAM   
4            Equity  Advanced Micro Devices Inc   AMD US Equity     TICKER  (181,500)    USD     13.490           AMD  
5  Convertible Bond          AMD 2 1/8 09/01/26    US007903BD80       ISIN  1,650,000    USD    185.500           AMD 

If you look, you can see the header has been sort of split amongst the first three rows. How do I fix this? Are there any clever, elegant solutions to take the 'best' of the three names for each column, and make that the column name?

Thanks!!


Solution

  • This is a pure pandas solution - assuming the dataframe is read exactly as pasted below.

    df.columns = df.columns.str.replace('Unnamed.*', '') + \
                 df.iloc[0].fillna('') + \
                 df.iloc[1].fillna('')
    
    df.drop([0,1], inplace=True)
    
    
    1        Asset Type                        Name      Identifier  \
    2            Equity     Akamai Technologies Inc  AKAM US Equity   
    3  Convertible Bond             AKAM 0 02/15/19    US00971TAG67   
    4            Equity  Advanced Micro Devices Inc   AMD US Equity   
    5  Convertible Bond          AMD 2 1/8 09/01/26    US007903BD80   
    
    1 IdentifierType   Quantity Currency    Price TAG0 StrategyName  
    2         TICKER    (5,830)      USD   65.000              AKAM  
    3           ISIN  1,595,000      USD  100.875              AKAM  
    4         TICKER  (181,500)      USD   13.490               AMD  
    5           ISIN  1,650,000      USD  185.500              None