Search code examples
pythonhtmlbeautifulsouphtml-parsing

How to add space around removed tags in BeautifulSoup


from BeautifulSoup import BeautifulSoup

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''


soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)

I have this sample code and i cant find how to add spaces around the removed tags so when the text inside the <a href...> get formatted it can be readable and wont display like this:

PoemThe RavenOnce upon a midnight dreary, while I pondered, weak and weary...

In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace


Solution

  • One option would be to find all text nodes and join them with a space:

    " ".join(item.strip() for item in poems.find_all(text=True))
    

    Additionally, you are using beautifulsoup3 package which is outdated and not maintained. Upgrade to beautifulsoup4:

    pip install beautifulsoup4
    

    and replace:

    from BeautifulSoup import BeautifulSoup
    

    with:

    from bs4 import BeautifulSoup