Search code examples
pythonhtmlregexhtml-parsingauto-increment

Python: How can I add a counter to the replacement argument of re.sub()


I'd like to add ids to html tags. For example, I'd like to change:

<p>First paragraph</p>
<p>Second paragraph</p>
<p>Third paragraph</p>

to

<p id="1">First paragraph</p>
<p id="2">Second paragraph</p>
<p id="3">Third paragraph</p>

IIRC, it's possible to use a lambda function to achieve this functionality, but I can't remember the exact syntax.


Solution

  • I would use an HTML parser, like BeautifulSoup.

    The idea is to iterate over all paragraphs using enumerate() for indexing, starting with 1:

    from bs4 import BeautifulSoup
    
    data = """
    <p>First paragraph</p>
    <p>Second paragraph</p>
    <p>Third paragraph</p>
    """
    
    soup = BeautifulSoup(data, 'html.parser')
    for index, p in enumerate(soup.find_all('p'), start=1):
        p['id'] = index
    
    print soup
    

    Prints:

    <p id="1">First paragraph</p>
    <p id="2">Second paragraph</p>
    <p id="3">Third paragraph</p>