Search code examples
rubyscreen-scrapingnokogiriwatirpage-object-gem

Scraping Table with Nokogiri and need JSON output


So, I have a table with multiple rows and columns.

<table>
  <tr>
    <th>Employee Name</th>
    <th>Reg Hours</th>
    <th>OT Hours</th>
  </tr>
  <tr>
    <td>Employee 1</td>
    <td>10</td>
    <td>20</td>
  </tr>
  <tr>
    <td>Employee 2</td>
    <td>5</td>
    <td>10</td>
  </tr>
</table>

There is also another table:

<table>
  <tr>
    <th>Employee Name</th>
    <th>Revenue</th>
  </tr>
    <td>Employee 2</td>
    <td>$10</td>
  </tr>
  <tr>
    <td>Employee 1</td>
    <td>$50</td>
  </tr>
</table>

Notice that the employee order may be random between the tables.

How can I use nokogiri to create a json file that has each employee as an object, with their total hours and revenue?

Currently, I'm able to just get the individual table cells with some xpath. For example:

puts page.xpath(".//*[@id='UC255_tblSummary']/tbody/tr[2]/td[1]/text()").inner_text

Edit:

Using the page-object gem and the link from @Dave_McNulla, I tried this piece of code just to see what I get:

class MyPage
  include PageObject

  table(:report, :id => 'UC255_tblSummary')

  def get_some_information
    report_element[1][2].text
  end
end

puts get_some_information

Nothing's being returned, however.

Data: https://gist.github.com/anonymous/d8cc0524160d7d03d37b

There's a duplicate of the hours table. The first one is fine. The other table needed is the accessory revenue table. (I'll also need the activations table, but I'll try to merge that from the code that merges the hours and accessory revenue tables.


Solution

  • I think the general approach is:

    1. Create a hash for each table where the key is the employee
    2. Merge the results from both tables together
    3. Convert to JSON

    Create a hash for each table where the key is the employee

    This part you can do in Watir or Nokogiri. It only makes sense to use Nokogiri if Watir is giving poor performance due large tables.

    Watir:

    #I assume you would have a better way to identify the tables than by index
    hours_table = browser.table(:index, 0)
    wage_table = browser.table(:index, 1)
    
    #Turn the tables into a hash
    employee_hours = {}
    hours_table.trs.drop(1).each do |tr| 
        tds = tr.tds
        employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}     
    end
    #=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
    
    employee_wage = {}
    wage_table.trs.drop(1).each do |tr| 
        tds = tr.tds
        employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}   
    end
    #=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
    

    Nokogiri:

    page = Nokogiri::HTML.parse(browser.html)
    
    hours_table = page.search('table')[0]
    wage_table = page.search('table')[1]
    
    employee_hours = {}
    hours_table.search('tr').drop(1).each do |tr| 
        tds = tr.search('td')
        employee_hours[ tds[0].text ] = {"Reg Hours" => tds[1].text, "OT Hours" => tds[2].text}     
    end
    #=> {"Employee 1"=>{"Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Reg Hours"=>"5", "OT Hours"=>"10"}}
    
    employee_wage = {}
    wage_table.search('tr').drop(1).each do |tr| 
        tds = tr.search('td')
        employee_wage[ tds[0].text ] = {"Revenue" => tds[1].text}   
    end
    #=> {"Employee 2"=>{"Revenue"=>"$10"}, "Employee 1"=>{"Revenue"=>"$50"}}
    

    Merge the results from both tables together

    You want to merge the two hashes together so that for a specific employee, the hash will include their hours as well as their revenue.

    employee = employee_hours.merge(employee_wage){ |key, old, new| new.merge(old) }
    #=> {"Employee 1"=>{"Revenue"=>"$50", "Reg Hours"=>"10", "OT Hours"=>"20"}, "Employee 2"=>{"Revenue"=>"$10", "Reg Hours"=>"5", "OT Hours"=>"10"}}
    

    Convert to JSON

    Based on this previous question, you can then convert the hash to json.

    require 'json'
    employee.to_json