I know the basic things of accessing a website and so (I just started learning yesterday), however I want to extract now. I checked out many tutorials of Mechanize/Nokogiri but each of them had a different way of doing things which made me confused. I want a direct bold way of how to do this:
I have this website: http://openie.allenai.org/sentences/rel=contains&arg2=antioxidant&title=Green+tea
and I want to extract certain things in a structured way. If I inspect the element of this webpage and go to the body, I see so many <dd>..</dd>
's under the <dl class="dl-horizontal">
. Each one of them has an <a>
part which contains a href. I would like to extract this href and the bold parts of the text ex <b>green tea</b>
.
I created a simple structure:
info = Struct.new(:ObjectID, :SourceID)
thus from each of these <dd>
will add the bold text to the object id and the href to the source id.
This is the start of the code I have, just retrieval no extraction:
agent = Mechanize.new { |agent| agent.user_agent_alias = "Windows Chrome" }
html = agent.get('http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green+tea').body
html_doc = Nokogiri::HTML(html)
The other thing is that I am confused about whether to use Nokogiri directly or through Mechanize. The problem is that there isn't enough documentation provided by Mechanize so I was thinking of using it separately.
For now I would like to know how to loop through these and extract the info.
Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe:
require 'nokogiri'
require 'open-uri'
url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea'
doc = Nokogiri::HTML(open(url))
doc.xpath('//dd/*/a').each do |a|
text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip}
href = a['href']
puts "OK: text=#{text.inspect}, href=#{href.inspect}"
end
# OK: text=["Green tea", "many antioxidants"], href="http://www.talbottteas.com/category_s/55.htm"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.skin-care-experts.com/tag/best-skin-care/page/4"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.specialitybrand.com/news/view/207.html"
In a nutshell, this solution uses XPath in two places:
a
element underneath each dd
element.b
element inside of the a
s in #1 above.The final trick is cleaning up the text within the "b" elements into something presentable, of course, you might want it to look different somehow.