Search code examples
xpathweb-scrapingscrapyweb-crawlerscreen-scraping

Writing XPath for selecting the description


I want to extract the description from the HTML Pages.

My div id is contains below data:

  <div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>
 <p>
    <strong>Responsibilities</strong>
  </p>
  <ul>
     <li> Ownership and oversight of full-cycle accounts payable responsibilities including but not limited to, invoice processing, maintaining vendor records, running payment reports according to payment schedules, reconciling vendor statements)</li>
     <li> Identify and implement process improvements and automation in appropriate areas throughout the AP cycle</li>
     <li> Provide excellent customer service to vendors and employees by researching and resolving inquiries in a timely manner</li>
     <li> Assist with month-end activities, accruals, reconciliation, preparing 1099s, and audit support</li>
   <li> Assist with ad-hoc requests</li>
  </ul>
 <p>
    <strong>Qualifications</strong>
 </p>
  <ul>
     <li> AA/AS degree or equivalent experience in accounting</li>
     <li> Three years or more of related experience</li>
     <li> Full cycle accounts payable knowledge</li>
  </ul>
  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

Here I need only data <p> tags. I don't want data which contains Responsibilities and Qualifications from

<p>Responsibliites</p><ul> ... </ul>
<p>Qualifications</p><ul> .. </ul>

This is not necessary and exclude it from XPATH.

I am using below code:

sel.xpath(
        'description',
        '//div[@class="container page_op-detail"][not(descendant-or-self::p/strong[contains(text(), "Qualifications")]/../ul[1])]'
    ).extract()

This is not working. Please help me to create XPath which items can exclude it. How can I write the XPATH for this type of queries?

Expected OUTPUT:

<div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>

  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

Solution

  • assuming that form and span tags are empty elements, you can try this xpath:

    /div[@class='container page_op-detail']/*[not(self::p[normalize-space(.)='Responsibilities']) 
                                            and not(self::ul[preceding-sibling::p[normalize-space(.)='Responsibilities']])
                                            and not(self::ul[preceding-sibling::p[normalize-space(.)='Qualifications']])
                                            and not(self::p[normalize-space(.)='Qualifications'])]