I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The intention is to extract this information and show a visual of the tables and their names for a user to select relevant tables from the list.
I have tried extracting tables and then extracting text as well from pdfs. I am successful at both but not at connecting the table name to the table.
def tables_from_pdfs(filespath):
pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
print(pdffiles)
dictionary = {}
keys = []
for file in pdffiles:
print(file)
n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
print(n)
tables_dict = {}
for i in range(n):
tables = camelot.read_pdf(file, pages = str(i))
tables_dict[i] = tables
head, tail = os.path.split(file)
tail = tail.replace(".pdf", "")
keys.append(tail)
dictionary[tail] = tables_dict
return dictionary, keys
The expected result is a table and the name of the table as stated in the pdf file. For instance: Table on page x of pdf name: Table 1. Blah Blah blah '''Table'''
Tables are listed with the TableList and Table functions in the camelot API found here: https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.TableList
start in the web page where it says:
Lower-Lower-Level Classes
Camelot does not have a reference to the table name just the cell data descriptions. It does use python's panda database API though which may have the table name in it.
Combine usage of Camelot and Pandas to get the table name.
Get the name of a pandas DataFrame
appended update to answer
from https://camelot-py.readthedocs.io/en/master/
import camelot
tables = camelot.read_pdf('foo.pdf')
tables
<TableList n=1>
tables.export('foo.csv', f='csv', compress=True) # json, excel, html
tables[0]
<Table shape=(7, 7)>
tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
df_table = tables[0].df # get a pandas DataFrame!
#add
df_table.name = 'name here'
#from https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
note: the added 'name' attribute is not part of df. While serializing the df, the added name attribute is lost.
More appended answer, the 'name' attribute is actually called 'index'.
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64