Search code examples
pythonweb-scrapingscrapy

How can i get only text in scrapy selector in python


I hope you are doing well.

<ul>
  <li>
    <s>Title:</s>
    De Aardappeleters
  </li>
  <li>
    <s>Dimensions:</s>
    82 x 114 cm
  </li>
  <li>
    <s>Media:</s>
    canvas
  </li>
  <li>
    <s>Style:</s>
    Realism
  </li>
  <li>
    <s>Date:</s>
    1885
  </li>              ______
  <li>                     |
    <s>Genre:</s>          | It is located on a page of the website here
    Modern                 |
  </li>              ______| 
</ul> 

I have an HTML block☝ that I want to receive a text from li. But unfortunately, this li has no class or ID that I can select.This block is for a site.

  <li>
    <s>Genre:</s>
    Modern
  </li>

I want to select the genre list and get the text.👇

Modern

The main problem here is that this block is different on another page.👇

<ul>
  <li>
    <s>Title:</s>
    De Aardappeleters
  </li>
  <li>
    <s>Dimensions:</s>
    82 x 114 cm
  </li>
  <li>
    <s>Media:</s>
    canvas
  </li>              ______
  <li>                     |
    <s>Genre:</s>          |And it is located here on another page.
    Modern                 |
  </li>              ______| 
  <li>
    <s>Style:</s>
    Realism
  </li>
  <li>
    <s>Date:</s>
    1885
  </li>
</ul>
OriginalTagFind = layout.css('article ul li s::text').getall()
    
TitleOriginal = [tag.strip() for tag in OriginalTagFind if tag.startswith('Genre:')] 
  

In my opinion, if I come to the place I have selected and print the text of the mother's list with Next Sibiling. is it possible؟


Solution

  • With a css selector you can use:

    'li:has(s):contains("Genre:")::text'

    With an xpath selector you can use:

    "//li[s[contains(text(), 'Genre')]]/text()"

    I have demonstrated using both with your example below:

    In [1]: html = """<ul>
       ...:   <li>
       ...:     <s>Title:</s>
       ...:     De Aardappeleters
       ...:   </li>
       ...:   <li>
       ...:     <s>Dimensions:</s>
       ...:     82 x 114 cm
       ...:   </li>
       ...:   <li>
       ...:     <s>Media:</s>
       ...:     canvas
       ...:   </li>
       ...:   <li>
       ...:     <s>Style:</s>
       ...:     Realism
       ...:   </li>
       ...:   <li>
       ...:     <s>Date:</s>
       ...:     188
       ...:   </li>
       ...:   <li>
       ...:     <s>Genre:</s>
       ...:     Modern
       ...:   </li>
       ...: </ul> """
    
    In [2]: selector = scrapy.Selector(text=html)
    
    In [3]: ''.join(selector.xpath("//li[s[contains(text(), 'Genre')]]/text()").getall()).strip()
    Out[3]: 'Modern'
    
    In [4]: ''.join(selector.css('li:has(s):contains("Genre:")::text').getall()).strip()
    Out[4]: 'Modern'