Search code examples
rubynokogirimechanizemechanize-ruby

How do I scrape data through Mechanize and Nokogiri?


I am working on an application which gets the HTML from http://www.screener.in/.

I can enter a company name like "Atul Auto Ltd" and submit it and, from the next page, scrape the following details: "CMP/BV" and "CMP".

I am using this code:

require 'mechanize'
require 'rubygems'
require 'nokogiri'

Company_name='Atul Auto Ltd.'
agent = Mechanize.new
page = agent.get('http://www.screener.in/')
form = agent.page.forms[0]
print agent.page.forms[0].fields
agent.page.forms[0]["q"]=Company_name
button = agent.page.forms[0].button_with(:value => "Search Company")
pages=agent.submit(form, button)
puts pages.at('.//*[@id="top"]/div[3]/div/table/tbody/tr/td[11]')
# not getting any output.

The code is taking me to the right page but I am don't know how to query to get the required data.

I tried different things but was unsuccessful.

If possible, can someone point me towards a nice tutorial which explains how to scrape a particular class from an HTML page. The XPath of the first "CMP/BV" is:

//*[@id="top"]/div[3]/div/table/tbody/tr/td[11]

but it is not giving any output.


Solution

  • Using Nokogiri I would go as below:

    Using CSS Selectors

    require 'nokogiri'
    require 'open-uri'
    
    doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))
    
    doc.class
    # => Nokogiri::HTML::Document
    doc.css('.table.draggable.table-striped.table-hover tr.strong td').class
    # => Nokogiri::XML::NodeSet
    
    row_data = doc.css('.table.draggable.table-striped.table-hover tr.strong td').map do |tdata|
      tdata.text
    end
    
     #From the webpage I took the below value from the table 
     #*Peer Comparison Top 7 companies in the same business*    
    
    row_data
    # => ["6.",
    #     "Atul Auto Ltd.",
    #     "193.45",
    #     "8.36",
    #     "216.66",
    #     "3.04",
    #     "7.56",
    #     "81.73",
    #     "96.91",
    #     "17.24",
    #     "2.92"]
    

    Looking at the table from the webpage I can see CMP/BV and CMP are the twelfth and third columns respectively. Now I can get the data from the array row_data. So CMP is the second index and CMP/BV is the last value of the array row_data.

    row_data[2] # => "193.45" #CMP
    row_data.last # => "2.92" #CMP/BV
    

    Using XPATH

    require 'nokogiri'
    require 'open-uri'
    
    doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))
    
    p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[3]").text
    p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[10]").text
    # >> "193.45" #CMP
    # >> "17.24"  #CMP/BV