Search code examples
pythonpdftext

Extract text from PDF files and preserve the orginal layout, in Python


I want to extract text from the PDF files but the layout of text in the PDF should be maintained, like the images below. Images show results from the [github.com/JonathanLink/PDFLayoutTextStripper]. results from PDFLayoutTextStripper I tried the below code but it doesn't maintain the Layout. I want get results exactly the same way as shown in the images by using any of the Python libraries like PyPDF2, PDFPlumber, PDFminer etc. I tried all these libraries but didn't get the desired results. I need help in extracting the text from the PDF file exactly as is shown in the images.

from pdfminer.high_level import extract_text`
text = extract_text('test.pdf')
print(text)

Solution

  • You can preserve layout/indentation using PDFtotext package.

    import pdftotext
    
    with open("target_file.pdf", "rb") as f:
        pdf = pdftotext.PDF(f)
    
    # All pages
    for text in pdf:
        print(text)