Search code examples
rubynokogiritextile

How do I replace tags defining a node?


We're trying to move from a rather small bug tracking system to Redmine. For our old system, there's no ready migration solution script available, so we want to do that ourselves.

I suggested using Nokogiri to move some of the formatting over to the new format (Textile), however, I ran into problems.

This is from the DB field in our old system's DB:

<ul>
    <li>list item 1</li>
    <li>list item 2</li>
</ul>

This needs to be translated into Textile, and it would look like this:

* list item 1
* list item 2

Now, starting to parse using Nokogiri, I'm here:

def self.handle_ul(page)
        uls = page.css("ul")
        uls.each {|ul|
                lis = ul.css("li")
                lis.each { |li|
                        li.inner_html = "*" << li.text << "\n"
                }
        }
end

This works like a charm. However, I need to do two replacements:

<li>
</li>

tags need to be removed from the <li> object, and:

<ul>
</ul>

tags need to be removed from the <ul> object. However, I cannot seem to find the actual tags in the object representing it. inner_html returned only the HTML between the tags I'm looking for:

ul.inner_html

Results in:

<li>list item 1</li>
<li>list item 2</li>

Where can I find the tags I need to replace? I thought about using parent and reassociate the child <li> tags with the parent.parent, but that would order them at the end of the grandparent.

Can I somehow access the whole HTML representation of an object, without stripping its defining tags out, so that I can replace them?


EDIT:

As requested, here is a mockup of an old DB entry and the style it should have in textile.

Before transformation:

Fixed for rev. 1.7.92.

<h4>Problems:</h4>
<ul>
<li>fixed.</li>
<li>fixed. New minimum 270x270</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed. Column types list is growing horizontally now.</li>
</ul>

After transformation:

Fixed for rev. 1.7.92.

h4.Problems:
* fixed.
* fixed. New minimum 270x270
* fixed.
* fixed.
* fixed.
* fixed. Column types list is growing horizontally now.

EDIT 2:

I tried to overwrite parts of the to_s method of the Nokogiri elements:

li.to_s["<li>"]=""

but that doesn't seem to be a valid lvalue (not that there is an error, it just doesn't do anything).


Solution

  • Here's the basis for such a transform:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(<<EOT)
    <ul>
        <li>list item 1</li>
        <li>list item 2</li>
    </ul>
    EOT
    puts doc.to_html
    
    doc.search('ul').each do |ul|
      ul.search('li').each do |li|
        li.replace("* #{ li.text.strip }")
      end
      ul.replace(ul.text)
    end
    
    puts doc.to_html
    

    Running that outputs:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><ul>
    <li>list item 1</li>
        <li>list item 2</li>
    </ul></body></html>
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body>* list item 1
        * list item 2
    </body></html>
    

    I didn't intend, or attempt, to make the first "item" have a leading carriage-return or line-feed. That's left as an exercise for the reader. Nor did I try to handle the <h4> tags or similar substitutions. From the answer code you should be able to figure out how to do it.

    Also, I'm using Nokogiri::HTML to parse the HTML, which turns it into a full HTML document with the appropriate DOCTYPE header, <html> and <body> tags to mimic a full HTML document. That could be changed using Nokogiri::HTML::DocumentFragment.parse instead but wouldn't really make a difference in the output.