Search code examples

How to use Mechanize to parse local file

I'm using Ruby and Mechanize to parse a local HTML file but I can't do it. This works if I use a URL though:

agent =
#url = ''
#page = agent.get(url) #this seems to work just fine but the following below doesn't

file ='/home/user/files/sample.htm') #this is a regular html file
page = Nokogiri::HTML(file)
pp page.body #errors here'/div[@class="product_name"]').each do |node|
  text = node.text  
  puts "product name: " + text.to_s

The error is:

/home/user/code/myapp/app/models/program.rb:35:in `main': undefined method `body' for #<Nokogiri::HTML::Document:0x000000011552b0> (NoMethodError)

How do I get a page object so that I can search on it?


  • Mechanize uses URI strings to point to what it's supposed to parse. Normally we'd use a "http" or "https" scheme to point to a web-server, and that's where Mechanize's strengths are, but other schemes are available, including "file", which can be used to load a local file.

    I have a little HTML file on my Desktop called "test.rb":

    <!DOCTYPE html>
    Hello World!

    Running this code:

    require 'mechanize'
    agent =
    page = agent.get('file:/Users/ttm/Desktop/test.html')
    puts page.body


    <!DOCTYPE html>
    Hello World!

    Which tells me Mechanize loaded the file, parsed it, then accessed the body.

    However, unless you need to actually manipulate forms and/or navigate pages, then Mechanize is probably NOT what you want to use. Instead Nokogiri, which is under Mechanize, is a better choice for parsing, extracting data or manipulating the markup and it's agnostic as to what scheme was used or where the file is actually located:

    require 'nokogiri'
    doc = Nokogiri::HTML('/Users/ttm/Desktop/test.html'))
    puts doc.to_html

    which then output the same file after parsing it.

    Back to your question, how to find the node only using Nokogiri:

    Changing test.html to:

    <!DOCTYPE html>
    <div class="product_name">Hello World!</div>

    and running:

    require 'nokogiri'
    doc = Nokogiri::HTML('/Users/ttm/Desktop/test.html'))'div.product_name').map(&:text)
    # => ["Hello World!"]

    shows that Nokogiri found the node and returned the text.

    This code in your sample could be better:

    text = node.text  
    puts "product name: " + text.to_s

    node.text returns a string:

    doc = Nokogiri::HTML('<p>hello world!</p>')'p').text # => "hello world!"'p').text.class # => String

    So text.to_s is redundant. Simply use text.