Search code examples
pythonxpathscrapy

Remove white spaces line breaks from the extracted text Python scraping


I am facing an issue regarding extracting text from the website page. I am using the XPath selector and Scrapy for this.

The page contains the markup like this:

<div class="snippet-content">
    <h2>First Child</h2>
    <p>Hello</p>
    This is large text ..........
</div>

I basically need the text after the 2 immediate children. The selector which I am using is this:

text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()

The text is extracted correctly but it contains white spaces, NBPS, and new line break \r\n characters.

For example:

Extracting text is like this:

"         \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023.                                "

Is there a way to get sanitized and clean text without all trailing whitespaces, linebreaks characters, and NBPS characters?


Solution

  • You can use the xpath function normalize-space, but this does more than simply removing whitespace from the beginning and end of a string. If the string also contains runs of spaces or other whitespace characters it would also reduce them down to a single whitespace regardless of where they are located in the string.

    Alternatively you can use the python str.strip method which by default(without setting a parameter) only removes whitespace characters from the beginning and end of a string.

    Examples:

    text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()
    
    text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()