I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map
is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]