I'm trying to code a program that extract a text from a PDF, search for key-words and highlight them, so write a new pdf with the key-words highlighted. IDK if i need to extract the text and then write a new one or if I can just highlight words without extracting them. I need to keep the text formatting, I tried to use reportlab, but it extracted the text and lost the text formatting. I'm new on programming, so maybe it's easy to solve the problem, but I dont have the skill.
I'm an electrical engineer and need to read a lot of technical specification, like IEC or NBR (Brazilian version of IEC), so if i have this code it'll help me a lot
Here is something that i code till now:
import PyPDF2
# Abre o arquivo PDF
pdf_file = r"C:\\Users\\pietro\\Desktop\\Projects\\espectest.pdf"
words = \["Teste"\]
# Cria um objeto PDFReader para o arquivo PDF aberto
pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_writer = PyPDF2.PdfWriter()
# Pega o número de páginas do PDF
num_pages = len(pdf_reader.pages)
# Cria uma lista zerada
# Obtém o texto da página atual do PDF
for i in range(num_pages):
texto = pdf_reader.pages\[i\].extract_text()
#here i need to discover how to highlight words and write them on the new file
# Imprime o texto da página atual do PDF
I've tried PyPDF2, reportlab, Fitz, PDFPlumber
Use PyMuPDF.
import fitz # PyMuPDF
my_keywords = ["kw1", "kw2", "kw3"]
doc = fitz.open("input.pdf") # the PDF
for page in doc: # iterate over the pages
for kw in my_keywords: # iterate over the keywords
rectlist = page.search_for(kw) # locate keyword on page
for rect in rectlist: # iterate over its occurences
page.add_highlight_annot(rect) # highlight it