I'm using BeautifulSoup 4 and Python 3.7. I want to extract the inner HTML from a found article. I have this
soup = BeautifulSoup(html, features="lxml")
...
article_elt = top_article_elt.select('div[class*="outer"]')[0]
article = article_elt.decode_contents()
...
print("article: " + str(article) + " score:" + str(score))
However, what is getting printed out includes the outer tags ...
article: <div class="outer"><p>Top story of the year.</p>
</div>
How do I write a statement that extracts only the inner HTML?
One quick fix could be to just go one level deep with .find()
:
article = article_elt.find().decode_contents()
But, this might just be treating a symptom and not the problem itself. It feels like it is either that you have nested div
elements with class="outer"
or, the class*="outer"
check matches some unexpected elements up the tree. Try:
article_elt = top_article_elt.select_one('div.outer')