I want to find xml tags of type x in a text that
Also something like this
<x> </x>
<x a="v"> </x>
I use following regular expression in combination with the Matcher find function.
<x.*?> +</x>
I get matches that I don't expect. See following test case
@Test
public void sample() throws Exception
{
String text = "Lorem <x>ipsum <x>dolor sit amet</x> </x>";
String regex = "<x.*?> +</x>";
Matcher matcher = Pattern.compile(regex).matcher(text);
assertFalse(matcher.find());
}
The test fails. Instead this is true
assertTrue(matcher.find());
assertEquals("<x>ipsum <x>dolor sit amet</x> </x>", matcher.group());
Does the find function not support the non-greedy operator or what goes wrong here?
PS I know that there is a plethora of different ways to process xml data. But this is not the point here.
The .*?
quantifier means that it will find as few characters as possible to satisfy the match, it doesn't mean that it will stop searching at the first >
it finds. So in your example, the <x.*?>
will match all of:
<x>ipsum <x>dolor sit amet</x>
With all the characters between the first x
and the the final >
satisfying the .*?
. To fix this, you can simply change your pattern to:
<x[^>]*> +</x>
On a side note, it's been stated many times before, but you should not use regular expressions to parse xml/html/xhtml.