I am facing an issue regarding extracting text from the website page. I am using the XPath
selector and Scrapy
for this.
The page contains the markup like this:
<div class="snippet-content">
<h2>First Child</h2>
<p>Hello</p>
This is large text ..........
</div>
I basically need the text after the 2 immediate children. The selector which I am using is this:
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()
The text is extracted correctly but it contains white spaces
, NBPS
, and new line break \r\n
characters.
For example:
Extracting text is like this:
" \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023. "
Is there a way to get sanitized and clean text without all trailing whitespaces
, linebreaks
characters, and NBPS characters?
You can use the xpath function normalize-space
, but this does more than simply removing whitespace from the beginning and end of a string. If the string also contains runs of spaces or other whitespace characters it would also reduce them down to a single whitespace regardless of where they are located in the string.
Alternatively you can use the python str.strip
method which by default(without setting a parameter) only removes whitespace characters from the beginning and end of a string.
Examples:
text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()
text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()