We're trying to move from a rather small bug tracking system to Redmine. For our old system, there's no ready migration solution script available, so we want to do that ourselves.
I suggested using Nokogiri to move some of the formatting over to the new format (Textile), however, I ran into problems.
This is from the DB field in our old system's DB:
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
This needs to be translated into Textile, and it would look like this:
* list item 1
* list item 2
Now, starting to parse using Nokogiri, I'm here:
def self.handle_ul(page)
uls = page.css("ul")
uls.each {|ul|
lis = ul.css("li")
lis.each { |li|
li.inner_html = "*" << li.text << "\n"
}
}
end
This works like a charm. However, I need to do two replacements:
<li>
</li>
tags need to be removed from the <li>
object, and:
<ul>
</ul>
tags need to be removed from the <ul>
object. However, I cannot seem to find the actual tags in the object representing it. inner_html
returned only the HTML between the tags I'm looking for:
ul.inner_html
Results in:
<li>list item 1</li>
<li>list item 2</li>
Where can I find the tags I need to replace? I thought about using parent
and reassociate the child <li>
tags with the parent.parent
, but that would order them at the end of the grandparent.
Can I somehow access the whole HTML representation of an object, without stripping its defining tags out, so that I can replace them?
EDIT:
As requested, here is a mockup of an old DB entry and the style it should have in textile.
Before transformation:
Fixed for rev. 1.7.92.
<h4>Problems:</h4>
<ul>
<li>fixed.</li>
<li>fixed. New minimum 270x270</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed. Column types list is growing horizontally now.</li>
</ul>
After transformation:
Fixed for rev. 1.7.92.
h4.Problems:
* fixed.
* fixed. New minimum 270x270
* fixed.
* fixed.
* fixed.
* fixed. Column types list is growing horizontally now.
EDIT 2:
I tried to overwrite parts of the to_s
method of the Nokogiri elements:
li.to_s["<li>"]=""
but that doesn't seem to be a valid lvalue (not that there is an error, it just doesn't do anything).
Here's the basis for such a transform:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
EOT
puts doc.to_html
doc.search('ul').each do |ul|
ul.search('li').each do |li|
li.replace("* #{ li.text.strip }")
end
ul.replace(ul.text)
end
puts doc.to_html
Running that outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><ul>
<li>list item 1</li>
<li>list item 2</li>
</ul></body></html>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>* list item 1
* list item 2
</body></html>
I didn't intend, or attempt, to make the first "item" have a leading carriage-return or line-feed. That's left as an exercise for the reader. Nor did I try to handle the <h4>
tags or similar substitutions. From the answer code you should be able to figure out how to do it.
Also, I'm using Nokogiri::HTML
to parse the HTML, which turns it into a full HTML document with the appropriate DOCTYPE header, <html>
and <body>
tags to mimic a full HTML document. That could be changed using Nokogiri::HTML::DocumentFragment.parse
instead but wouldn't really make a difference in the output.