My goal is to analyze a postal code and to identify the separate parts using a regular expression and the analyze-string function.
I use MarkLogic 10. Using the regex to match validates the example below correctly. However, when I use it to analyze the string it fails to identify the various groups correctly:
(: analyze dutch postal code :)
let $regex := "^[1-9]\d{3}([A-Z]{2}(\d+(\S+)?)?)?$"
return fn:analyze-string("1234AA11bis", $regex)
it returns the following :
<s:analyze-string-result xmlns:s="http://www.w3.org/2005/xpath-functions">
<s:match>1234<s:group nr="1">AA<s:group nr="2">1<s:group nr="3">1bis</s:group></s:group></s:group>
</s:match>
</s:analyze-string-result>
I expect it to return '11' as the value of group nr 2 and 'bis' as the result of group nr 3.
I used some online regex analyzers that return the correct result. Am I missing some flag or something or is this just a bug in MarkLogic?
I am not sure what the specs have to say about nested greedy patterns, but there is an easy fix:
let $regex := "^[1-9]\d{3}([A-Z]{2}(\d+([^\d\s]+)?)?)?$"
return fn:analyze-string("1234AA11bis", $regex)
HTH!