Search code examples
pythonregexpython-reroman-numerals

Recognize roman numeral followed by '.', space and then capital letter. (RegEx)


Can someone please help me with this?

I'm trying to match roman numerals with a "." at the end and then a space and a capital letter after the point. For example:

I. And here is a line.

II. And here is another line.

X. Here is again another line.

So, the regex should match the "I. A", "II. A" and "X. H".

I did this "^(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}){1,4}\.\s[A-Z]" But the problem is that this RegEx is also matching with ". A" and i don't want it.

In resume it should have at least one roman numeral, followed by a "." and then a space and a capital letter.


Solution

  • You need a (?=[LXVI]) lookahead at the start that would require at least one Roman number letter at the start of the string:

    ^(?=[LXVI])(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\.\s[A-Z]
    # ^^^^^^^^^
    

    See the regex demo. Not sure why you used {1,4}, I suggest removing it.

    Another workaround here would be to use a word boundary right after ^:

    ^\b(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\.\s[A-Z]
    #^^
    

    This would disallow a match where . appears at the start since \b, required at the same position as the start of string, requires that the next char must be a word char (and here, it must be a Roman number).

    Regarding \.\s[A-Z], you may enhance it you add + or * after \s, and if you ever need to match it and exclude from a match, turn it into a positive lookahead, (?=\.\s+[A-Z]) or (?=\.\s*[A-Z]).