Search code examples
html-parsingbeautifulsoup

BeautifulSoup: parse only part of the page


I want to parse a part of html page, say

my_string = """
<p>Some text. Some text. Some text. Some text. Some text. Some text.
   <a href="#">Link1</a>
   <a href="#">Link2</a>
</p>
<img src="image.png" />
<p>One more paragraph</p>
"""

I pass this string to BeautifulSoup:

soup = BeautifulSoup(my_string)
# add rel="nofollow" to <a> tags
# return comment to the template

But during parsing BeautifulSoup adds <html>,<head> and <body> tags (if using lxml or html5lib parsers), and I don't need those in my code. The only way I've found up to now to avoid this is to use html.parser.

I wonder if there is a way to get rid of redundant tags using lxml - the quickest parser.

UPDATE

Originally my question was asked incorrectly. Now I removed <div> wrapper from my example, since common user does not use this tag. For this reason we cannot use .extract() method to get rid of <html>, <head> and <body> tags.


Solution

  • I could solve the problem using .contents property:

    try:
        children = soup.body.contents
        string = ''
        for child in children:
            string += str(item)
        return string
    except AttributeError:
        return str(soup)
    

    I think that ''.join(soup.body.contents) would be more neat list to string converting, but this does not work and I get

    TypeError: sequence item 0: expected string, Tag found