Search code examples
javaregexscjp

Confusion over Regex greedy operator, and terminating character


I'm studying up for the SCJP exam, and the following mock question caught me offguard. The explanation in the tool wasn't very good so I'm hoping the knowledgeable people of SO can explain it.

With the regex of C.*L, identify the words it would capture from CooLooLCuuLooC

I selected CooL and CuuL. My reason for this choice is because I believed it would look for a starting match of C, then take any character zero or more times until it finds an L, and then terminate.

However, the answer is actually CooLooLCuuL. I'm confused as to how the first 2 L's make it through?

Could anyone please clear this up for me?

Thanks


Solution

  • Just one more possibly useful explanation:

    The .* matches anything (except, by default, newlines!!!!), zero or more times - you understood that, generally. However, .*? also meets that definition. The difference is greediness...

    • .* will match anything until it can't match anything else ('greedy' or 'eager')
    • .*? will match anything until the following expression can be matched ('non-greedy' or 'reluctant')

    Thus, C.*L will find a capital C, then match ooLooLCuuLooC with .*. It will then find it has to match a capital L. Being at the end of the string, that's not possible, so it goes to where it can match an L, forcing the .* to give up the characters LooC in order to do so. Result: CooLooLCuuL

    If you were to use C.*?L it will find C, then match o, testing the next o for a match to L. This would fail, making it match oo and testing the next L for a match to L. This would succeed and it would return CooL.

    A third option for matching either CooL or CuuL (that is any strings that start with C and end with L) would be C[^L]*L. This matches C, then any number of characters that are not a capital L, then a capital L.