Search code examples
pythonhtmlpdfextract

Extract txt and tables from PDF & Word


I'm building an app which extracts text and tables from pdf and it's creating HTML from that. What is the best way to make sure I'm extracting the data in the same format from the PDF & Word document? Since currently for example I have an issue with one of the files which contain table which is spitted on 2 pages it's not extracting it right and in the HTML I can see the spacing it's not the same as original file.

I have tried to extract the tables using pdfplumber , for the most part but like I mention it's having issues with tables which are on 2 pages.


Solution

  • The PDF format is based on a fixed layout and lacks any logical or semantic structure. While a visually formatted document might appear to have a clear organisation into headings, paragraphs and tables, this structure is mostly not explicitly represented in the PDF's internal data hierarchy.

    Unlike HTML or Word documents, there is nothing internally in the PDF file which indicates a table is present. It is just text organised by physical x,y locations on the document which "looks" like a table. This makes it very difficult for any library -- Pdfplumber, Camelot, Tabula, Pdftables, Pdf-table-extract etc to extract tables 100% right all the time. They use various techniques like rules based, computer vision, machine learning or a combination of those to extract tables from PDFs. When tables in PDFs overflow over to the next page, it makes it even harder as there might be page headers, page numbers or watermarks which occur before the table continues.

    Converting PDF to Word and then trying extraction will also not work for the same reason. The convertor will not know that it is a table it is extracting before it can be recreated as "table" in Word.

    There are two solutions which work remarkable well from my experience:

    1. High quality OCR software. I have seen success with Azure's document intelligence.
    2. Extract the text in a layout preserving manner and then pass the text to a LLM to extract tables. This works well with GPT4 and Gemini Pro.

    But both the approaches need you to send data to a 3rd party.