Search code examples
pdf-scrapingpython-camelot

Headers are not getting extracted from PDF while extracting the table data from PDF using camelot


I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below enter image description here

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.


Solution

  • I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.

    You can still use the table_areas keyword argument with flavor='stream' to get the table out.

    Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

    Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

    You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

    Hope that helps!

    enter image description here