There is a very long text in xml format like:
><span class='ocrx_word' id='word_1_21_0_1_0' title='bbox 409 912 417 927'><</span><span class='ocrx_word' id='word_1_21_0_1_1' title='bbox 416 911 446 925'><forest>...
This hocr text is made by google document ai. I want to make a searchable pdf using the hocr file, but when I try to make the pdf, the pdf library I use shows me an error. The library handles the word <forest>
as corrupted xml element. So I want to replace the word <forest>
into <forest>
.
I could find the patterns using a regex: (?!<(div|span|\/span).*>)(<.*>)
This expression excludes the <span>
and </span>
elements, and only includes the words surrounded between <
and >
.
But how can I change only the first and the last character?
You can use the following instruction:
re.sub(r"(?!<(?:div|span|\/span).*>)<([^<>]*)>", "<\1>", my_string)
Note that <
and >
are excluded from the capturing group.
I've also replaced .*
with [^<>]*
, because .
matches also <
and >
.
See a demo here.