Search code examples
pythonpdfhighlight

Extracting text from a PDF with Python - Highlight


I'm trying to code a program that extract a text from a PDF, search for key-words and highlight them, so write a new pdf with the key-words highlighted. IDK if i need to extract the text and then write a new one or if I can just highlight words without extracting them. I need to keep the text formatting, I tried to use reportlab, but it extracted the text and lost the text formatting. I'm new on programming, so maybe it's easy to solve the problem, but I dont have the skill.

I'm an electrical engineer and need to read a lot of technical specification, like IEC or NBR (Brazilian version of IEC), so if i have this code it'll help me a lot

Here is something that i code till now:

import PyPDF2

# Abre o arquivo PDF

pdf_file = r"C:\\Users\\pietro\\Desktop\\Projects\\espectest.pdf"

words = \["Teste"\]

# Cria um objeto PDFReader para o arquivo PDF aberto

pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_writer = PyPDF2.PdfWriter()

# Pega o número de páginas do PDF

num_pages = len(pdf_reader.pages)

# Cria uma lista zerada

pages=\[\]

# Obtém o texto da página atual do PDF

for i in range(num_pages):
page=pdf_reader.pages\[i\]
texto = pdf_reader.pages\[i\].extract_text()
pages.append(page)
pdf_writer.add_page(page)
\#---------------------------------------------------------------------------------------

#here i need to discover how to highlight words and write them on the new file

#----------------------------------------------------------------------------------------

# Imprime o texto da página atual do PDF

pdf_writer.write("especteste123.pdf")

I've tried PyPDF2, reportlab, Fitz, PDFPlumber

Solution

  • Use PyMuPDF.

    import fitz  # PyMuPDF
    
    my_keywords = ["kw1", "kw2", "kw3"]
    doc = fitz.open("input.pdf")  # the PDF
    for page in doc:  # iterate over the pages
        for kw in my_keywords:  # iterate over the keywords
            rectlist = page.search_for(kw)  # locate keyword on page
            for rect in rectlist:  # iterate over its occurences
                page.add_highlight_annot(rect)  # highlight it
    
    doc.save("output.pdf")