Search code examples
pythonbeautifulsouphtml-content-extraction

Extracting tag content based on content value using BeautifulSoup


I have a Html document of the following format.

<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>

I want to extract the content of paragraph tag, including the content of italic and bold tag but not the content of anchor tag. Also, possible ignoring the Numeric in the beginning.

The expected output is: Content of the paragraph in italic but not strong.

What is the best way to do it?

Also, the following code snippet returns TypeError: argument of type 'NoneType' is not iterable

soup = BSoup(page)
for p in soup.findAll('p'):
    if '&nbsp;&nbsp;&nbsp;' in p.string:
        print p

Thanks for the suggestions.


Solution

  • Your code fails because tag.string is set if the tag has only one child and that child is NavigableString

    You can achieve what you want by extracting the a tag:

    from BeautifulSoup import BeautifulSoup
    
    s = """<p>&nbsp;&nbsp;&nbsp;1. Content of the paragraph <i> in italic </i> but not <b> strong </b> <a href="url">ignore</a>.</p>"""
    soup = BeautifulSoup(s, convertEntities=BeautifulSoup.HTML_ENTITIES)
    
    for p in soup.findAll('p'):
        for a in p.findAll('a'):
            a.extract()
        print ''.join(p.findAll(text=True))