Search code examples
pythonpython-3.xbeautifulsoupinnerhtml

With BeautifulSoup 4 (lxml parser), how do I extract inner HTML from a tag (decode_contents not working)?


I'm using BeautifulSoup 4 and Python 3.7. I want to extract the inner HTML from a found article. I have this

soup = BeautifulSoup(html, features="lxml")
...
article_elt = top_article_elt.select('div[class*="outer"]')[0]
article = article_elt.decode_contents()
...
print("article: " + str(article) + " score:" + str(score))

However, what is getting printed out includes the outer tags ...

article: <div class="outer"><p>Top story of the year.</p>
</div>

How do I write a statement that extracts only the inner HTML?


Solution

  • One quick fix could be to just go one level deep with .find():

    article = article_elt.find().decode_contents()
    

    But, this might just be treating a symptom and not the problem itself. It feels like it is either that you have nested div elements with class="outer" or, the class*="outer" check matches some unexpected elements up the tree. Try:

    article_elt = top_article_elt.select_one('div.outer')