Search code examples
phpregexstringbackreferencecamelcasing

PHP regex and adjacent capturing groups


I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.

I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:

HelloWorldThisIsATest => hello-world-this-is-a-test

My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:

mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));

The result:

hello-world-this-is-atest

This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.

What am I doing wrong?


Solution

  • The Reason your Regex will Not Work: Overlapping Matches

    • Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
    • In order to insert a - between the A and the T, the regex would have to match AT.
    • This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
    • Is all hope lost? No! This is a perfect situation for lookarounds.

    Do it in Two Easy Lines

    Here's the easy way to do it with regex:

    $regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
    echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));
    

    See the output at the bottom of the php demo:

    Output: hello-world-this-is-a-test

    Will add explanation in a moment. :)

    • The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
    • The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
    • The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
    • We just replace these positions with a -, and convert the lot to lowercase.

    If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.

    Reference