Search code examples
rubyweb-scrapingnokogiri

How to scrape the text of <li> and children


I am trying to scrape the content of <li> tags and within them.

The HTML looks like:

 <div class="insurancesAccepted">
   <h4>What insurance does he accept?*</h4>
   <ul class="noBottomMargin">
      <li class="first"><span>Aetna</span></li>
      <li>
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">AvMed</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="last first">Open Access</li>
         </ul>
      </li>
      <li>
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">Blue Cross Blue Shield</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="last first">Blue Card PPO</li>
         </ul>
      </li>
      <li>
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">Cigna</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="first">Cigna HMO</li>
            <li>Cigna PPO</li>
            <li class="last">Great West Healthcare-Cigna PPO</li>
         </ul>
      </li>
      <li class="last">
         <a title="See accepted plans" class="insurancePlanToggle arrowUp">Empire Blue Cross Blue Shield</a>
         <ul style="display: block;" class="insurancePlanList">
            <li class="last first">Empire Blue Cross Blue Shield HMO</li>
         </ul>
      </li>
   </ul>
  </div>

The main issue is when I am trying to get content from:

doc.css('.insurancesAccepted li').text.strip

It displays all <li> text at once. I want "AvMed" and "Open Access" scraped at the same time with a relationship parameter so that I can insert it into my MySQL table with reference.


Solution

  • The problem is that doc.css('.insurancesAccepted li') matches all nested list items, not only direct descendants. To match only a direct descendant one should use a parent > child CSS rule. To accomplish your task you need to carefully assemble the result of the iteration:

    doc = Nokogiri::HTML(html)
    result = doc.css('div.insurancesAccepted > ul > li').each do |li|
      chapter = li.css('span').text.strip
      section = li.css('a').text.strip
      subsections = li.css('ul > li').map(&:text).map(&:strip)
    
      puts "#{chapter} ⇒ [ #{section} ⇒ [ #{subsections.join(', ')} ] ]"
      puts '=' * 40
    end
    

    Resulted in:

    # Aetna ⇒ [  ⇒ [  ] ]
    # ========================================
    #  ⇒ [ AvMed ⇒ [ Open Access ] ]
    # ========================================
    #  ⇒ [ Blue Cross Blue Shield ⇒ [ Blue Card PPO ] ]
    # ========================================
    #  ⇒ [ Cigna ⇒ [ Cigna HMO, Cigna PPO, Great West Healthcare-Cigna PPO ] ]
    # ========================================
    #  ⇒ [ Empire Blue Cross Blue Shield ⇒ [ Empire Blue Cross Blue Shield HMO ] ]
    # ========================================