I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.
PROBLEM is that when I scrape CONTENT of the article <p>
that content is filled with additional tags like - strong
, a
etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:
<p> According to <a> Japan's newspapers </a> it happened ... </p>
Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:
I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.
Please provide your selector for more detailed help.
Given what you're describing, I'd guess you're selecting p/text()
(xml) or p::text
(css), which is not going to get the text in the children of <p>
elements.
You should try selecting response.xpath('//p/descendant-or-self::*/text()')
to get the text in the <p>
and all it's children.
You could also just select the <p>
, not its text, and you'll get its children as well. From there you can start cleaning up the tags. There are answered questions regarding how to do that.