Search code examples

How do I replace tags defining a node?

We're trying to move from a rather small bug tracking system to Redmine. For our old system, there's no ready migration solution script available, so we want to do that ourselves.

I suggested using Nokogiri to move some of the formatting over to the new format (Textile), however, I ran into problems.

This is from the DB field in our old system's DB:

    <li>list item 1</li>
    <li>list item 2</li>

This needs to be translated into Textile, and it would look like this:

* list item 1
* list item 2

Now, starting to parse using Nokogiri, I'm here:

def self.handle_ul(page)
        uls = page.css("ul")
        uls.each {|ul|
                lis = ul.css("li")
                lis.each { |li|
                        li.inner_html = "*" << li.text << "\n"

This works like a charm. However, I need to do two replacements:


tags need to be removed from the <li> object, and:


tags need to be removed from the <ul> object. However, I cannot seem to find the actual tags in the object representing it. inner_html returned only the HTML between the tags I'm looking for:


Results in:

<li>list item 1</li>
<li>list item 2</li>

Where can I find the tags I need to replace? I thought about using parent and reassociate the child <li> tags with the parent.parent, but that would order them at the end of the grandparent.

Can I somehow access the whole HTML representation of an object, without stripping its defining tags out, so that I can replace them?


As requested, here is a mockup of an old DB entry and the style it should have in textile.

Before transformation:

Fixed for rev. 1.7.92.

<li>fixed. New minimum 270x270</li>
<li>fixed. Column types list is growing horizontally now.</li>

After transformation:

Fixed for rev. 1.7.92.

* fixed.
* fixed. New minimum 270x270
* fixed.
* fixed.
* fixed.
* fixed. Column types list is growing horizontally now.


I tried to overwrite parts of the to_s method of the Nokogiri elements:


but that doesn't seem to be a valid lvalue (not that there is an error, it just doesn't do anything).


  • Here's the basis for such a transform:

    require 'nokogiri'
    doc = Nokogiri::HTML(<<EOT)
        <li>list item 1</li>
        <li>list item 2</li>
    puts doc.to_html'ul').each do |ul|'li').each do |li|
        li.replace("* #{ li.text.strip }")
    puts doc.to_html

    Running that outputs:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
    <li>list item 1</li>
        <li>list item 2</li>
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
    <html><body>* list item 1
        * list item 2

    I didn't intend, or attempt, to make the first "item" have a leading carriage-return or line-feed. That's left as an exercise for the reader. Nor did I try to handle the <h4> tags or similar substitutions. From the answer code you should be able to figure out how to do it.

    Also, I'm using Nokogiri::HTML to parse the HTML, which turns it into a full HTML document with the appropriate DOCTYPE header, <html> and <body> tags to mimic a full HTML document. That could be changed using Nokogiri::HTML::DocumentFragment.parse instead but wouldn't really make a difference in the output.