Search code examples
pythonpdf

How to remove watermark from PDF file using Python's PyPDF2 lib


I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib. Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:

import requests
from io import StringIO, BytesIO
import PyPDF2

def pdf_content_extraction(pdf_link):

    all_pdf_content = ''

    #sending requests
    response = requests.get(pdf_link)
    my_raw_data = response.content


    pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
    #extract text page by page
    with BytesIO(my_raw_data) as data:
        read_pdf = PyPDF2.PdfFileReader(data)

        #looping trough each page
        for page in range(read_pdf.getNumPages()):
            page_content = read_pdf.getPage(page).extractText()
            page_content = page_content.replace("\n\n\n", "\n").strip()

            #store data into variable for each page
            pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'

    all_pdf_content += pdf_file_text + "\n\n"
        
    return all_pdf_content



pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'

print(pdf_content_extraction(pdf_link))

This is the result that I'm getting:

#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...

My question is, how can I fix this problem? Is there a way to remove watermark from page or something like that? I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?


Solution

  • The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.

    When I look at the underlying info of your PDF I note that it was created using these apps:

    • 'Producer': 'GPL Ghostscript 9.10'
    • 'Creator': 'PDFCreator Version 1.7.3

    When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:

    • '/Creator': 'Acrobat PDFMaker 11 für Excel'
    • '/Producer': 'Adobe PDF Library 11.0'

    I can read the second file perfectly with PyPDF2.

    When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.

    • 'Producer': 'GPL Ghostscript 9.10'
    • 'Creator': 'PDFCreator Version 1.7.3

    This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.

    I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.

    I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.

    Here are some of the sequences:

    page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
    # output
    \u02d8
    \u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
    \u02d8\u02c6!"""\u02c6\u02d8\u02c6!
    
    
    page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
    # output
    # ˘
    ˇˆ˙˝˙˛˚˙˜ 
    ˘ˆ!"""ˆ˘ˆ!
    
    

    I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.

    I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.


    You can definitely extract the content from your document using other Python modules.

    Tika

    This example uses Tika and BeautifulSoup to extract the content in German from your source document.

    import requests
    from tika import parser
    from io import BytesIO
    from bs4 import BeautifulSoup
    
    pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
    response = requests.get(pdf_link)
    with BytesIO(response.content) as data:
        parse_pdf = parser.from_buffer(data, xmlContent=True)
    
        # Parse metadata from the PDF
        metadata = parse_pdf['metadata']
    
        # Parse the content from the PDF
        content = parse_pdf['content']
    
        # Convert double newlines into single newlines
        content = content.replace('\n\n', '\n')
        soup = BeautifulSoup(content, "lxml")
        body = soup.find('body')
        for p_tag in body.find_all('p'):
            print(p_tag.text.strip())
    
    

    pdfminer

    This example uses pdfminer to extract the content from your source document.

    import requests
    from io import BytesIO
    from pdfminer.high_level import extract_text
    
    
    pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
    response = requests.get(pdf_link)
    with BytesIO(response.content) as data:
        text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
                            codec='utf-8', laparams=None)
        print(text.replace('\n\n', '\n').strip())