Find and insert tags using regex

I am converting a book from PDF to epub in calibre. But the titles are not inside header tags, hence trying a python function using regex to replace it.

example text:

<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>

i tried using regex with

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
    list = re.findall(pattern, match.group())
    
    for x in list:
        x = "</a>(?i)chapter [0-9]+: [\w\s]+(.?)<br>"
        x = s.split("</a>", 1)[0] + '</a><h2>' + s.split("a>",1)[1]
        x = s.split("<br>", 1)[0] + '</h2><br>' + s.split("<br>",1)[1]
    return match.group()

and


def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
    s.replace(re.match(pattern, s), r'<h2>$0')

But still not getting expected result. what i want is...

Input

</a>Chapter 370: Slamming straight on</p>

Output

</a><h2>Chapter 370: Slamming straight on</h2></p>

h2 tag is to be added in all similar instances

Solution

regex shoudn't be used to parse xml. See : Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms (Why shouldn't you..would be a better title)

However, you might use BeautifulSoup instead:

from bs4 import BeautifulSoup
data = """<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
i t"""

soup = BeautifulSoup(data, 'lxml')


for x in soup.find_all('p', {'class':'calibre1'}):

    link = x.find('a')
    title = x.text
    corrected_title = soup.new_tag('h2')
    corrected_title.append(title)

    if link:
        x.string=''
        corrected_title = soup.new_tag('h2')
        corrected_title.append(title)
        link.append(corrected_title)
        x.append(link)

print(soup.body)

Output

<body>
    <p class="calibre1">
        <a id="p1">
            <h2>Chapter 370: Slamming straight on</h2>
        </a>
    </p>
    <p class="softbreak"> </p>
    <p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
    <p class="calibre1">
        <a id="p7">
            <h2>Chapter 372: Yan Zhaoge’s plan</h2>
        </a>
    </p>
    <p class="softbreak"> </p>
    <p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
    i t
</body>