Search code examples
pythonhtmlweb-scrapingscrapy

Step up or down one div from located tag that contains a specific value using Scrapy


I need to retrieve the price text from inside the custom-control / label / font style. The only way to identify which value the price belongs to is the data-number data-number="025.00286R". The letter at the end is the only element that differentiates the different control section divs.

<div class="custom-control custom-checkbox">
   <input type="checkbox" class="custom-control-input" data-number="025.00286R" name="itemSelected[]" value="7684cd019b98489eb330010000039848" id="checkbox-7684cd019b98489eb330010000039848">
   <label class="custom-control-label" for="checkbox-7684cd019b98489eb330010000039848">
      <meta itemprop="price" content="676.0512">
      <font style="vertical-align: inherit;"><font style="vertical-align: inherit;">
      €676.05
      </font></font>
   </label>
</div>

I use this code to retrieve the total number of data-number's within the page:

box_contents = response.css('div[class*="mad-article-list-box"]').re(r"[0-9]+.\d+[0-9][A-Z]+")
box_contents = list(dict.fromkeys(box_contents))

So that box contents are presented in a list (for each number in the list there is an identical custom control class:

['025.00286GA', '025.00286GV', '025.00286NWA', '025.00286NW', '025.00286NWV', '025.00286R']

The problem now is that the <input type="checkbox" does not contain any children divs, and I need the nested text contents of the div below it. <label class="custom-control-label"

I can locate the <input with:

response.xpath('//input[contains(@data-number, "' + box_contents[0] + '")]')

However, now I need to either step up one in the xpath after locating the <input type="checkbox" or step down 1 in the xpath. After that it is easy to extract all the nested text and the value that I am looking for €676.05. How would I go about doing this? Is there a better way to accomplish this?


Solution

  • You could iterate through each of the div elements with the custom-control classes individually and pull the information for each checkbox and label one at a time, instead of gathering them all each at once. Then the two pieces of data will already be paired up since you will be iterating one pair at a time, and since you will be starting at a parent element to both data elements, finding the correct path to each element is more straightforward.

    For example:

    html = """
    <div class="custom-control custom-checkbox">
       <input type="checkbox" class="custom-control-input" data-number="025.00286R" name="itemSelected[]" value="7684cd019b98489eb330010000039848" id="checkbox-7684cd019b98489eb330010000039848">
       <label class="custom-control-label" for="checkbox-7684cd019b98489eb330010000039848">
          <meta itemprop="price" content="676.0512">
          <font style="vertical-align: inherit;"><font style="vertical-align: inherit;">
          €676.05
          </font></font>
       </label>
    </div>
    """
    
    import parsel
    selector = parsel.Selector(html)
    
    for control in selector.xpath("//div[@class='custom-control custom-checkbox']"):
        data_number = control.xpath("./input/@data-number").get()
        price = control.xpath(".//meta/@content").get()
        print({"data_number": data_number, "price": price })
    

    Output

    {'data_number': '025.00286R', 'price': '676.0512'}