Search code examples
rubywatir

Watir scraping sequential elements : so simple, but no


This is so simple... I want to scrap some web page like that with watir (gem of ruby:)

<div class="Time">time1</div> 
<div class="Locus">locus1</div>
<div class="Locus">locus2</div>
<div class="Time">time2</div>
<div class="Locus">locus3</div>
<div class="Time">time3</div>
<div class="Locus">locus4</div>
<div class="Locus">locus5</div>
<div class="Locus">locus6</div>
<div class="Time">time4</div>
etc..

The result should be an array like that :

time1 locus1
time1 locus2
time2 locus3
time3 locus4
time3 locus5
time3 locus6
time4 xxx

All the divs are at the same level (not imbricated). No way to find the solution using the watir methods... Thx for your help


Solution

  • For each Locus element, you can retrieve the preceding Time element via the #preceding_sibling method:

    result = browser.divs(class: 'Locus').map do |div|
      time = div.preceding_sibling(class: 'Time').text
      locus = div.text
      "#{time} #{locus}"
    end
    p result
    #=> ["time1 locus1", "time1 locus2", "time2 locus3", "time3 locus4", "time3 locus5", "time3 locus6"]
    

    Note that if the list is long, you may want to retrieve the HTML via Watir but then do the parsing in Nokogiri. This would save a lot of execution time, but at the cost of readability.

    doc = Nokogiri::HTML.parse(browser.html) # where `browser` is the usual Watir::Browser
    result = doc.css('.Locus').map do |div|
      time = div.at('./preceding-sibling::div[@class="Time"]').text
      locus = div.text
      "#{time} #{locus}"
    end
    p result
    #=> ["time1 locus1", "time1 locus2", "time1 locus3", "time1 locus4", "time1 locus5", "time1 locus6"]