Search code examples
xmlxpathcomments

How to find specific substrings in XML files that aren't inside comments?


In my XML, I have comments like <!--INS--><!--/INS--><!--DEL--><!--/DEL-->; I want to ignore any matching text within it when I'm searching for specific substrings.

For example, my XML file has:

<p>XXXX YYYY ZZZZ 
<!--INS-->,,<!--/INS-->
<!--DEL-->..<!--/DEL-->
AAA BBB CCC DDD..
</p>

I want to find the double dot elements (p tag) but need to ignore double dot within both "INS" and "DEL" tags.

I have tried my xpath

//p[contains(.,'..') and descendant::comment()[not(contains(.,'..'))]]

but it is not working. How can I do this in Xpath code?


Solution

  • Your ".." and ",," are not inside comment()'s, they are in text()-nodes between comment()'s. So if understand correctly you need this(wrong assumption see EDIT):

    //p[ends-with(normalize-space(),'..') and not(comment()[contains('INSDEL',.) and following-sibling::node()[1][self::text()[.='..']]])]
    

    This wil not match your example.

    Explanation of this part:

    following-sibling::node()[1][self::text()
    

    It will select the text()-node that is direct following of that comment.

    If you want that only this will not match (both ..)(also wrong assumption see EDIT)

    <p>XXXX YYYY ZZZZ 
    <!--INS-->..<!--/INS-->
    <!--DEL-->..<!--/DEL-->
    AAA BBB CCC DDD..
    </p>
    

    You need:

        //p[ends-with(normalize-space(),'..') and 
    not(comment()[.='INS' and following-sibling::node()[1][self::text()[.='..']]] 
    and comment()[.='DEL' and following-sibling::node()[1][self::text()[.='..']]])]
    

    EDIT:

    The following XPath:

    //p[text()[not(preceding-sibling::node()[1][self::comment()=('INS','DEL') ] ) and contains(.,'..')]]
    

    will match this example:

      <p>Save at least 15% on local breaks, longer trips, or anything in between.. Plan your next getaway for less.
        <!--INS-->Book between Mar.. 15 - 31<!--/INS-->
        <!--DEL-->Stay between May. 15-31<!--/DEL--> Getaway Deals. </p>
    

    because the two dots are in text()-nodes that are not between comments

    But wil not match

      <p>Save at least 15% on local breaks, longer trips, or anything in between. Plan your next getaway for less.
        <!--INS-->Book between Mar.. 15 - 31<!--/INS-->
        <!--DEL-->Stay between May. 15-31<!--/DEL--> Getaway Deals. </p>
    

    Because the only double dots are in text()-nodes between the comments