Search code examples
htmlrubynokogirimechanize

Loop over all the <dd> tags and extract specefic information via Mechanize/Nokogiri


I know the basic things of accessing a website and so (I just started learning yesterday), however I want to extract now. I checked out many tutorials of Mechanize/Nokogiri but each of them had a different way of doing things which made me confused. I want a direct bold way of how to do this:

I have this website: http://openie.allenai.org/sentences/rel=contains&arg2=antioxidant&title=Green+tea

and I want to extract certain things in a structured way. If I inspect the element of this webpage and go to the body, I see so many <dd>..</dd>'s under the <dl class="dl-horizontal">. Each one of them has an <a> part which contains a href. I would like to extract this href and the bold parts of the text ex <b>green tea</b>.

I created a simple structure:

info = Struct.new(:ObjectID, :SourceID) thus from each of these <dd> will add the bold text to the object id and the href to the source id.

This is the start of the code I have, just retrieval no extraction:

agent = Mechanize.new { |agent| agent.user_agent_alias = "Windows Chrome" }
html = agent.get('http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green+tea').body
html_doc = Nokogiri::HTML(html)

The other thing is that I am confused about whether to use Nokogiri directly or through Mechanize. The problem is that there isn't enough documentation provided by Mechanize so I was thinking of using it separately.

For now I would like to know how to loop through these and extract the info.


Solution

  • Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe:

    require 'nokogiri'
    require 'open-uri'
    
    url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea'
    doc = Nokogiri::HTML(open(url))
    
    doc.xpath('//dd/*/a').each do |a|
      text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip}
      href = a['href']
      puts "OK: text=#{text.inspect}, href=#{href.inspect}"
    end
    
    # OK: text=["Green tea", "many antioxidants"], href="http://www.talbottteas.com/category_s/55.htm"
    # OK: text=["Green tea", "potent antioxidants"], href="http://www.skin-care-experts.com/tag/best-skin-care/page/4"
    # OK: text=["Green tea", "potent antioxidants"], href="http://www.specialitybrand.com/news/view/207.html"
    

    In a nutshell, this solution uses XPath in two places:

    1. Initially to find every a element underneath each dd element.
    2. Then to find each b element inside of the as in #1 above.

    The final trick is cleaning up the text within the "b" elements into something presentable, of course, you might want it to look different somehow.