Search code examples
python-3.xscrapyscrapy-shell

response selector is breaking the content into two different values


I'm trying to scrape titles of the article from this page - https://onlinelibrary.wiley.com/doi/full/10.1111/pcmr.12547

In "scrapy shell" if I run this response.css("h2.article-section__title::text").extract() I'm getting the following output -

[' Efficacy of small MC1R‐selective ',
 '‐MSH analogs as sunless tanning agents that reduce UV‐induced DNA damage\n         ',
.....

This is happening because, in HTML, the article is using an additional italics tag in the title.

<h2 class="article-section__title section__title section1" id="pcmr12547-sec-0002-title"> Efficacy of small MC1R‐selective <i>α </i>‐MSH analogs as sunless tanning agents that reduce UV‐induced DNA damage
         </h2>

I can try to fix this it with a python code, which will combine the values until it receives '\n' at the end. But is there any way to fix it through scrapy or any other cleaner way?

A way in which the scrapy will scrape the value along with the HTML tags(if any) in it or, better ignore the tags but will get the text within the tag?


Solution

  • You can extract the whole HMTL element with:

    html_title = response.css(".article-section__title").get()
    

    Then you can turn the result into plain text with something like html-text:

    title = html_text.extract_text(html_title)