I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing
As Cary and James answered in comments above, I used (?:\S+\s*){0,15}
. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}
.