Search code examples
pythonpython-3.xpython-camelot

How to extract table name along with table using camelot from pdf files using python?


I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The intention is to extract this information and show a visual of the tables and their names for a user to select relevant tables from the list.

I have tried extracting tables and then extracting text as well from pdfs. I am successful at both but not at connecting the table name to the table.

def tables_from_pdfs(filespath):
    pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
    print(pdffiles)
    dictionary = {}
    keys = []
    for file in pdffiles:
        print(file)
        n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
        print(n)
        tables_dict = {}
        for i in range(n):
            tables = camelot.read_pdf(file, pages = str(i))
            tables_dict[i] = tables
        head, tail = os.path.split(file)
        tail = tail.replace(".pdf", "")
        keys.append(tail)
        dictionary[tail] = tables_dict
    return dictionary, keys

The expected result is a table and the name of the table as stated in the pdf file. For instance: Table on page x of pdf name: Table 1. Blah Blah blah '''Table'''


Solution

  • Tables are listed with the TableList and Table functions in the camelot API found here: https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.TableList


    start in the web page where it says:


    Lower-Lower-Level Classes


    Camelot does not have a reference to the table name just the cell data descriptions. It does use python's panda database API though which may have the table name in it.


    Combine usage of Camelot and Pandas to get the table name.


    Get the name of a pandas DataFrame


    appended update to answer


    from https://camelot-py.readthedocs.io/en/master/

    import camelot
    tables = camelot.read_pdf('foo.pdf')
    tables
    <TableList n=1>
    tables.export('foo.csv', f='csv', compress=True) # json, excel, html
    tables[0]
    <Table shape=(7, 7)>
    tables[0].parsing_report
    {
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
    }
    tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
    df_table = tables[0].df # get a pandas DataFrame!
    
    #add
    df_table.name = 'name here'
    
    
    #from https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe
    import pandas as pd
    df = pd.DataFrame( data=np.ones([4,4]) )
    df.name = 'Ones'
    
    print df.name
    

    note: the added 'name' attribute is not part of df. While serializing the df, the added name attribute is lost.


    More appended answer, the 'name' attribute is actually called 'index'.


    Getting values
    
    >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
    ...      index=['cobra', 'viper', 'sidewinder'],
    ...      columns=['max_speed', 'shield'])
    >>> df
                max_speed  shield
    cobra               1       2
    viper               4       5
    sidewinder          7       8
    
    Single label. Note this returns the row as a Series.
    
    >>> df.loc['viper']
    max_speed    4
    shield       5
    Name: viper, dtype: int64