Search code examples
pythonbeautifulsoupunwrap

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags


I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.

I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:

from bs4 import BeautifulSoup

html = """
<div>
<div class="somewhat">
    <div class="not quite">
    </div>
    <div class="here">
    <blockquote>
        <span>
            <a href = "sth.jpg"><br />content<br /></a>
        </span>
    </blockquote>
    </div>
    <div class="not here either">
    </div>
</div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
    for y in x.find_all('div', {"class":"here"}):    # find all the "here" divs
        for inp in y.find_all("blockquote"):         # in a "here" div find all blockquote tags for the relevant content
            for newlines in inp('br'):
                inp.br.replace_with("\n")            # replace br tags
            for link in inp('a'):
                inp.a.unwrap()                       # unwrap all a tags
            for quote in inp('span'):
                inp.span.unwrap()                    # unwrap all span tags
            for block in inp('blockquote'):
                inp.blockquote.unwrap()              # <----- should unwrap blockquote
            la_lista.append(inp)

print(la_lista)

The result is as follows:

[<blockquote>


content


</blockquote>]

Any ideas?


Solution

  • The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').

    The solution for you is to remove:

                for block in inp('blockquote'):
                    inp.blockquote.unwrap()   
    

    and replace:

    la_lista.append(inp)
    

    with:

    la_lista.append(inp.decode_contents())
    

    The answer is based on the following answer BeautifulSoup innerhtml