I would like to highlight text in my pdf file by using PyMuPDF library. The method search_for() return the location of the searched words. the problem is this method ignore spaces. Upper / lower case.it works only for ASCII characters.
is there any solution to get the location\coordinate without using search_for()
my Code:
pattern=re.compile(r'(\[V2G2-\d{3}\])(\s{1,}\w(.+?)\. )')
for m in re.finditer(pattern,text):
macted.append(m.group())
def doHighleigh():
pdf_document = fitz.open("ISO_15.pdf")
page_num = pdf_document.page_count
for i in range(page_num):
page = pdf_document[i]
for item in macted:
search_instances = page.search_page_for(item,quad=True)
for q in search_instances:
highlight = page.add_highlight_annot(q)
#RGB(127, 255, 255)
highlight.set_colors({"stroke": (0.5, 1, 1), "fill": (0.75, 0.8, 0.95)})
highlight.update()
pdf_document.save(r"output.pdf")
it igone the sec. sentence because the spaces between the words.
Using the search method is just one way to get hold of coordinates required for highlighting. You can also use any of the page.get_text()
variants returning text coordinates. Looking at your example, the "blocks" variant may be sufficient, or a combination of "words" and "blocks" extractions.
page.get_text("blocks")
returns a list of items like (x0, y0, x1, y1, "line1\nline2\n, ...", blocknumber, blocktype)
. The first 4 items in the tuple are the coordinates of the enveloping rectangle.
page.get_text("words")
You also can extract a list of words (strings containing no spaces) with similar items: (x0, y0, x1, y1, "wordstring", blocknumber, linenumber, wordnumber)
.
You could inspect the "words" for items matching the regex pattern and then highlight the respective block. Probably can even be done without regular expressions. Here is a snippet that may serve your intention:
def matches(word):
if word.startswith("[V2G2-") and word.endswith(("]", "].")):
return True
return False
def add_highlight(page, rect):
"""Highlight annots have no fill color"""
annot = page.add_highlight_annot(rect)
annot.set_colors(stroke=(0.5,1,1))
annot.update()
flags = fitz.TEXTFLAGS_TEXT # need identical flags for all extractions
for page in doc:
blocks = page.get_text("blocks", flags=flags)
words = page.get_text("words", flags=flags)
for word in words:
blockn = word[-3] # block number
if matches(word[4]):
block = blocks[blockn] # get the containing block
block_rect = fitz.Rect(block[:4])
add_highlight(page, block_rect)
So the approach used here is: check if a block contains a matching word. If so, highlight it.