Search code examples
rubyxpathnokogiridelicious-api

Best way to parse a file with links exported from Delicious.com using Nokogiri?


I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:

<DT>
   <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
      ADD_DATE="1233132422"
      PRIVATE="0"
      TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
   <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" 
      ADD_DATE="1226827542" 
      PRIVATE="0" 
      TAGS="irw_20">Minority Report Interface</A>
<DT>
   <A HREF="http://www.windowshop.com/" 
      ADD_DATE="1225267658" 
      PRIVATE="0" 
      TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon

As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.

I do the following to get the link information:

doc.xpath('//dt//a').each do |node|
  title = node.text
  url = node['href']
  tags = node['tags']
  puts "#{title}, #{url}, #{tags}"
end

My question is how do I get the link information AND the comment when a dd tag is present?


Solution

  • My question is how do I get the link information AND the comment when a dd tag is present?

    Use:

    //DT/a | //DT[a]/following-sibling::*[1][self::DD]
    

    This selects all a elements that have a DT parent and all DD elements that are the immediate following sibling element of a DT element that has an a child.

    Note: The use of the // is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.

    Whenever the structure of the XML document is known, avoid using the // abbreviation.