Search code examples
javaregexstringreplacemarkup

Java regex to match all html elements except one special case


I have a string with some markup which looks like this:

The quick brown <a href="www.fox.org">fox</a> jumped over the lazy <a href="entry://id=6000009">dog</a> <img src="dog.png" />.

I'm trying to strip away everything except the anchor elements with "entry://id=" inside. Thus the desired output from the above example would be:

The quick brown fox jumped over the lazy <a href="entry://id=6000009">dog</a>.

Writing this match, the closest I've come so far is:

<.*?>!<a href=\"entry://id=\\d+\">.*?<\\/a>

But I can't figure out why this doesn't work. Any help (apart from the "why don't you use a parser" :) would be greatly appreciated!


Solution

  • Using this :

    ((<a href="entry://id=\d+">.*?</a>)|<!\[CDATA\[.*?\]\]>|<!--.*?-->|<.*?>)
    

    and combining it with a replace all $2 would work for your example. The code below proves it:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    import static org.junit.Assert.*;
    import org.junit.Test;
    
    
    public class TestStack1305864 {
    
        @Test
        public void matcherWithCdataAndComments(){
            String s="The quick <span>brown</span> <a href=\"www.fox.org\">fox</a> jumped over the lazy <![CDATA[ > ]]> <a href=\"entry://id=6000009\">dog</a> <img src=\"dog.png\" />.";
            String r="The quick brown fox jumped over the lazy <a href=\"entry://id=6000009\">dog</a> .";
            String pattern="((<a href=\"entry://id=\\d+\">.*?</a>)|<!\\[CDATA\\[.*?\\]\\]>|<!--.*?-->|<.*?>)";
            Pattern p = Pattern.compile(pattern);
            Matcher m = p.matcher(s);
    
            String t = s.replaceAll(pattern, "$2");
            System.out.println(t);
            System.out.println(r);
            assertEquals(r, t);
        }
    }
    

    The idea is to capture all the elements you are interested to keep in a specific group so you can insert them back in the string.
    This way you can replace all :
    For every element which doesn't match the interesting ones the group will be empty and the element will be replaced with ""
    For the interesting elements the group will not be empty and will be appended to the result String.

    edit: handle nested < or > in CDATA and comments
    edit: see http://martinfowler.com/bliki/ComposedRegex.html for a regex composition pattern, designed to make regex more readable for maintenance purposes.