Search code examples
pythonpdftype-conversion

Convert edited PDF into TXT


I’m trying to compile some code to convert PDF to text, but the result is not what I expected. I have tried different libraries such as pytesseract, pdfminer, pdftotext, pdf2image, and OpenCV, but all of them extract the text incompletely or with errors. The last two codes that I used are these:

def convert_pdf_to_txt(path):
  text = extract_text(path)
  return text

# Change the file path according to the location of your PDF file
pdf_path = '/content/drive/MyDrive/PDF/file.pdf'

# Convert the PDF to text
text = convert_pdf_to_txt(pdf_path)

# Write the text to a file
with open('extracted_text.txt', 'w') as file:
file.write(text)

# Print a confirmation message
print('The text has been saved to the "extracted_text.txt" file.')

However, when I use online PDF to text converters, the conversion comes out very well, almost perfect, without the errors that I encounter in both codes. Here I attach the PDF that I want to convert to text and the results that I get from both codes when I try to convert my file.

These are the attached documents:

https://anonfiles.com/P09bnen5z6/file_pdf https://anonfiles.com/g7Aan6n5ze/Archive_txt

I’m trying to compile some code to convert PDF to text, but the result is not what I expected. I have tried different libraries such as pytesseract, pdfminer, pdftotext, pdf2image, and OpenCV, but all of them extract the text incompletely or with errors. The last two codes that I used are these:


Solution

  • There should be no errors with dump PDFtoTEXT, but depends how its used and commanded plus quality of source PDF which if faulty it cannot edit to correct, that needs to be done before or after extraction.

    I will not dump the full page(s) due to the personal data so forgive the ####

    DOCUMENT #2021.07.B
    
    
    
    
    P.O. Box ####, Altamonte Springs, FL #### Office:
    (407) 2##-#### - Fax: (407) 6##-#### Email:
    ####@A##Q######A########.com
    Florida Licensed #### ######### MRSA ####
    Florida Licensed Engineering Firm COA #####
                                                                                        CUSTOMER INFO
    
      Client Name: C#### G#####                                                                                                       Date: ## / ## / 202#
    
      Address: #### Southwest ##### ###                                                    City: #### City                                              Zip: #####
    
      Home Phone:                                                  Cell Phone: 401-###-####                                     Email: ###@###.com
    
      Insurance Company: Frontline                                                                      Date of Loss: #/##/202#
    
      Policy #: 010000#####                                                                             Claim #: FPH3-0000#####
    
    
    
                                                                                     CONTRACT FOR SERVICES
    I, the Homeowner/Insured, and/or its representative for the property listed above (hereinafter “Client”), authorize Air Quality Assessors of Florida, its subcontractors and/or assignees
    (hereinafter collectively referred to as “AQA”), to enter said property to perform assessment services, including but not limited to indoor environmental assessments, asbestos testing,
    engineering inspections and post-mitigation verification. Client and AQA hereby acknowledge that the services to be provided are NOT being provided in an urgent or emergency circumstance
    as defined by 627.7152. Client agrees to fully cooperate with insurance company as required by the subject policy of insurance and comply with all post-loss duties required by same. Clients
    understands that the assessment services to be rendered are directly related and necessary as a result of the above-referenced loss and that it should provide a copy of any report prepared
    by AQA to its repair contractors to ensure a complete and proper repair of the damage to the subject property and to obtain any necessary building permits.
    
    
                                                    ASSIGNMENT OF INSURANCE CLAIM BENEFITS & DIRECT PAY AUTHORIZATION
    

    In my right click "sendto" folder C:\Users\me\AppData\Roaming\Microsoft\Windows\SendTo I have a shortcut to a cmd script that accepts a PDF file and runs

    cd /d "%~dp1"
    "C:\Apps\PDF\poppler\23.01.0\Library\bin\pdftotext.exe" -layout -enc UTF-8 -nopgbrk "%~dpn1.pdf"
    echo continue to open in notepad
    pause
    notepad "%~dpn1.txt"
    

    For windows users the easy zipped binaries are at https://github.com/oschwartz10612/poppler-windows or you can use the conda variants if newer.