I want to parse a part of html page, say
my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
<a href="#">Link1</a>
<a href="#">Link2</a>
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""
I pass this string to BeautifulSoup:
soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template
But during parsing BeautifulSoup adds <html>
,<head>
and <body>
tags (if using lxml or html5lib parsers), and I don't need those in my code. The only way I've found up to now to avoid this is to use html.parser
.
I wonder if there is a way to get rid of redundant tags using lxml - the quickest parser.
UPDATE
Originally my question was asked incorrectly. Now I removed <div>
wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract()
method to get rid of <html>
, <head>
and <body>
tags.
I could solve the problem using .contents property:
try:
children = soup.body.contents
string = ''
for child in children:
string += str(item)
return string
except AttributeError:
return str(soup)
I think that ''.join(soup.body.contents)
would be more neat list to string converting, but this does not work and I get
TypeError: sequence item 0: expected string, Tag found