Search code examples
rubynokogiri

Replacing part of the text in a Nokogiri node while preserving markup in contents


I'm trying to replace instances of a unique string across a bunch of files by scanning the content of the nodes with Nokogiri and then performing a gsub. I'm keeping part of the string in place, and transforming it into an anchor tag. However, the majority of the nodes have various forms of markup in the contents, and aren't just straightforward strings. For example, let's say I have a file like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
    <head>
        <title>Title</title>
        <link href="style.css" rel="stylesheet" type="text/css" />
    </head>
    <body>
        <div>
            <p class="header">&lt;&lt;2&gt;&gt;Header</p>
            <p class="paragraph">
            <p class="text_style">Lorem ipsum blah blah blah. &lt;&lt;3&gt;&gt; Here is more content. <span class="style">Preserve this.</span> Blah blah extra text.</p>
        </div>
    </body>
</html>

There are numbers throughout the document, surrounded by &lt;&lt; and &gt;&gt;. I want to take the value of the number and transform it into a tag like this: <a id='[#]'/>, but I want to preserve the HTML markup of other elements within the same section, i.e. <span class="style">Preserve this.</span>.

Here's everything I've tried:

file = File.open("file.xhtml") {|f| Nokogiri::XML(f)}

file.xpath("//text()").each { |node|
    if node.text.match(/<<([^_]*)>>/)
        new_content = node.text.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
        node.parent.inner_html = new_content
    end
}

The gsub works correctly, but because it uses the .text method, any markup is ignored and effectively wiped out. In this case, the <span class="style">Preserve this.</span> part is completely removed. (FYI, I use the .parent method because if I just do node.inner_html = new_content I get this error: add_child_node': cannot reparent Nokogiri::XML::Element there (ArgumentError).)

If I do this instead:

    new_content = node.text.gsub(/<<([^_]*)>>/,"<a id=\"\\1\"/>")
    node.content = new_content

the characters aren't properly escaped: the file ends up with &lt;a id="3"/&gt; instead of <a id="3"/>.

I tried using the CSS methods instead like so:

file.xpath("*").each { |node|
    if node.inner_html.match(/&lt;&lt;([^_]*)&gt;&gt;/)
        new_content = node.inner_html.gsub(/&lt;&lt;([^_]*)&gt;&gt;/,"<a id=\"\\1\"/>")
        node.inner_html = new_content
    end
}

The gsub works, the markup is preserved, and the replaced tags are escaped properly. But the <head> and <body> tags are removed, which results in an invalid file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>Title</title>
        <link href="style.css" rel="stylesheet" type="text/css"/>
        <div>
            <p class="header"><a id="2"/>Header</p>
            <p class="paragraph">
            </p><p class="text_style">Lorem ipsum blah blah blah. <a id="3"/> Here is more content. <span class="style">Preserve this.</span> Blah blah extra text. </p>    
    </div>
</html>

I suspect it has something to do with the fact that I'm iterating over all the nodes (file.css("*")), which is also redundant, since a parent node is scanned in addition to its children.

I've scoured the web but can't find any solutions for this. I just want to be able to swap out unique text while maintaining markup and having it be correctly encoded. Is there something very obvious that I'm missing here?


Solution

  • It looks like this works pretty well:

    require 'nokogiri'
    
    doc = Nokogiri::XML(<<EOT)
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <html>
        <head>
            <title>Title</title>
            <link href="style.css" rel="stylesheet" type="text/css" />
        </head>
        <body>
            <div>
                <p class="header">&lt;&lt;2&gt;&gt;Header</p>
                <p class="paragraph">
                <p class="text_style">Lorem ipsum. &lt;&lt;3&gt;&gt; more content. <span class="style">Preserve this.</span> extra text.</p>
            </div>
        </body>
    </html>
    EOT
    
    doc.search("//text()[contains(.,'<<')]").each do |node|
      node.replace(node.content.gsub(/<<(\d+)>>/, '<a id="[\1]" />'))
    end
    

    Which results in:

    puts doc.to_html
    
    # >> <html>
    # >>     <head>
    # >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    # >>         <title>Title</title>
    # >>         <link href="style.css" rel="stylesheet" type="text/css">
    # >>     </head>
    # >>     <body>
    # >>         <div>
    # >>             <p class="header"><a id="[2]"></a>Header</p>
    # >>             <p class="paragraph">
    # >>             <p class="text_style">Lorem ipsum. <a id="[3]"></a> more content. <span class="style">Preserve this.</span> extra text.</p>
    # >>         </p>
    # >>     </div>
    # >> </body>
    # >> </html>
    

    Nokogiri is adding the

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    

    line, probably because the markup is defined as XML.

    The selector "//text()[contains(.,'<<')]" is only looking for text nodes containing '<<'. You might want to modify that to make it more specific if it's possible to result in false positives. See "XPath: using regex in contains function" for the syntax.

    replace is performing the trick; You were trying to modify a Nokogiri::XML::Text node to contain an <a.../>, but it can't, the < and > must be encoded. Changing the node to a Nokogiri::XML::Element, which is what Nokogiri defaults <a id="[2]"> to, lets it store it as you want.