Search code examples
htmlarraysrubyweb-scrapingarrayofarrays

How to populate an array with text from html webscraping in ruby


I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:

<div class="table-wrap">
   <table class="table">
     <tbody>
        <tr>
           <td class="tableData"> Jane Doe</td>
           <td class="tableData"> 01/01/2017</td>
           <td class="tableData">01/09/2017 </td>
           <td class="tableData">Vacation</td>
        </tr>
        <tr>
           <td class="tableData">John Doe</td>
           <td class="tableData"> 01/01/2017</td>
           <td class="tableData">01/09/2017 </td>
           <td class="tableData">Vacation</td>
        </tr>
     </tbody>
   </table>
</div>

and the code I used to webscrape looks like this:

vt = page.css("td[class='tableData']").text
puts vt

Which gives this output:

Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation

I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:

[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]

I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.


Solution

  • This is the basic structure you need. map is needed :

    html=%q(<div class="table-wrap">
       <table class="table">
         <tbody>
            <tr>
               <td class="tableData"> Jane Doe</td>
               <td class="tableData"> 01/01/2017</td>
               <td class="tableData">01/09/2017 </td>
               <td class="tableData">Vacation</td>
            </tr>
            <tr>
               <td class="tableData">John Doe</td>
               <td class="tableData"> 01/01/2017</td>
               <td class="tableData">01/09/2017 </td>
               <td class="tableData">Vacation</td>
            </tr>
         </tbody>
       </table>
    </div>)
    
    require 'nokogiri'
    doc = Nokogiri::XML(html)
    array = doc.xpath('//tr').map do |tr|
      tr.xpath('td').map{ |td| td.text }
    end
    
    p array
    # [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]