I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:
<DT>
<A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
ADD_DATE="1233132422"
PRIVATE="0"
TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
<A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html"
ADD_DATE="1226827542"
PRIVATE="0"
TAGS="irw_20">Minority Report Interface</A>
<DT>
<A HREF="http://www.windowshop.com/"
ADD_DATE="1225267658"
PRIVATE="0"
TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.
I do the following to get the link information:
doc.xpath('//dt//a').each do |node|
title = node.text
url = node['href']
tags = node['tags']
puts "#{title}, #{url}, #{tags}"
end
My question is how do I get the link information AND the comment when a dd tag is present?
My question is how do I get the link information AND the comment when a dd tag is present?
Use:
//DT/a | //DT[a]/following-sibling::*[1][self::DD]
This selects all a
elements that have a DT
parent and all DD
elements that are the immediate following sibling element of a DT
element that has an a
child.
Note: The use of the //
is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.
Whenever the structure of the XML document is known, avoid using the //
abbreviation.