Search code examples
pythonregexcloud-document-ai

Is there a solution to select the first and the last character of certain regex patterns?


There is a very long text in xml format like:

><span class='ocrx_word' id='word_1_21_0_1_0' title='bbox 409 912 417 927'><</span><span class='ocrx_word' id='word_1_21_0_1_1' title='bbox 416 911 446 925'><forest>...

This hocr text is made by google document ai. I want to make a searchable pdf using the hocr file, but when I try to make the pdf, the pdf library I use shows me an error. The library handles the word <forest> as corrupted xml element. So I want to replace the word <forest> into &lt;forest&gt;.

I could find the patterns using a regex: (?!<(div|span|\/span).*>)(<.*>)

This expression excludes the <span> and </span> elements, and only includes the words surrounded between < and >. But how can I change only the first and the last character?


Solution

  • You can use the following instruction:

    re.sub(r"(?!<(?:div|span|\/span).*>)<([^<>]*)>", "&lt;\1&gt", my_string)
    

    Note that < and > are excluded from the capturing group.

    I've also replaced .* with [^<>]*, because . matches also < and >.

    See a demo here.