Search code examples
web-scrapingscrapy

Scrapy - Cleaning up text[/p] from nested links[/a] etc


I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.

PROBLEM is that when I scrape CONTENT of the article <p> that content is filled with additional tags like - strong, a etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:

<p> According to <a> Japan's newspapers </a> it happened ... </p>

Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:

enter image description here

I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.


Solution

  • Please provide your selector for more detailed help.

    Given what you're describing, I'd guess you're selecting p/text() (xml) or p::text (css), which is not going to get the text in the children of <p> elements.

    You should try selecting response.xpath('//p/descendant-or-self::*/text()') to get the text in the <p> and all it's children.

    You could also just select the <p>, not its text, and you'll get its children as well. From there you can start cleaning up the tags. There are answered questions regarding how to do that.