Search code examples
pythonregexhtml-parsingbeautifulsoup

How to remove whitespace in BeautifulSoup


I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

Ideally I'd like

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

I don't have any <pre> tags so I can be a little more forceful there.

Thanks once again!


Solution

  • Here is how you can do it without regular expressions:

    >>> html = """    <li><span class="plaincharacterwrap break">
    ...                     Zazzafooky but one two three!
    ...                 </span></li>
    ... <li><span class="plaincharacterwrap break">
    ...                     Zazzafooky2
    ...                 </span></li>
    ... <li><span class="plaincharacterwrap break">
    ...                     Zazzafooky3
    ...                 </span></li>
    ... """
    >>> html = "".join(line.strip() for line in html.split("\n"))
    >>> html
    '<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'