Search code examples
pythonbeautifulsouphtml-parsing

Update/ Add Nested Data using BeautifulSoup


I'm working with BeautifulSoup/Python to parse an HTML page and update the content as required. A dummy structure of my HTML page structure is as follows:

<div class="main">
<div class="class_1">
<p><br/></p>
<div class="panel">Some content here </div>
<div class="panel">Another content here </div>
</div>
</div>

I would like to update the content of <div class="class_1">. I'm able to successfully use BeautifulSoup parser to get the contents of <div class="class_1">. I'm also able to save the new data that I would like to have in my HTML page as list as displayed below:

['<div class="panel">Some content here </div>', 
'<div class="panel">Updated new content here </div>', 
'<div class="panel">Hello new div here! </div>']

How can I get the following? I tried replace_with but it replaces < with &lt; which isn't desirable and I'm not too familiar with Beautiful soup so not sure what other options are available that can help me achieve the following.

<div class="main">
<div class="class_1">
<p><br/></p>
<div class="panel">Some content here </div>
<div class="panel">Updated new content here </div>
<div class="panel">Hello new div here! </div>
</div>
</div>

Solution

  • Try:

    from bs4 import BeautifulSoup
    
    html_doc = """
    <div class="main">
    <div class="class_1">
    <p><br/></p>
    <div class="panel">Some content here </div>
    <div class="panel">Another content here </div>
    </div>
    </div>
    """
    
    new_content = [
        '<div class="panel">Some content here </div>',
        '<div class="panel">Updated new content here </div>',
        '<div class="panel">Hello new div here! </div>',
    ]
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    # locate the correct <p> element:
    p = soup.select_one(".class_1 p")
    
    # delete old content:
    # tags:
    for t in p.find_next_siblings():
        t.extract()
    # text (if any):
    for t in p.find_next_siblings(text=True):
        t.extract()
    
    # place new content:
    p.insert_after(BeautifulSoup("\n" + "\n".join(new_content) + "\n", "html.parser"))
    
    print(soup)
    

    Prints:

    <div class="main">
    <div class="class_1">
    <p><br/></p>
    <div class="panel">Some content here </div>
    <div class="panel">Updated new content here </div>
    <div class="panel">Hello new div here! </div>
    </div>
    </div>