Search code examples
phpregexregex-lookaroundspreg-splitdelimited

Split string on "and" and a few symbols, but prevent split on hyphens if surrounded by digits


For some data processing I need to split a string into multiple items. An example of an input string is:

'one, two & three and four-five 123-456'

Now, I need to separate this string into items, where possible delimiters are ,, &, (space), and, -. But, and this is the point where I'm stuck, it should not split on a - when it is between two numbers.

I am using PHP and preg_split to do the actual splitting, but I need a regex pattern to match the delimiters excluding the delimiter - when it is between two numbers (digits, but could also be 123-456). Suppression of spaces around each item is done with trim() in PHP.

I am using the following regex pattern:

/(and|,|\s|&)|\D(-)\D/

The output (after using preg_split, etc) is:

[0] => one
[1] => two
[2] => three
[3] => fou
[4] => ive
[5] => 123-456

The working is correct, but it also takes the last and first letter of the surrounding text for the - delimiter. The item 123-456 is correct, since it should not match (and split with preg_split) on a - when it is immediately surrounded by a number.

Expected output is:

[0] => one
[1] => two
[2] => three
[3] => four
[4] => five
[5] => 123-456

Any help is appreciated, if any information is lacking let me know and I'll update my question.


Solution

  • What you want to use is lookahead and lookbehind (more generally known as lookaround):

    /and|,|\s|&|(?<!\d)-(?!\d)/
    

    What this will do is exactly what the name implies - look around to check if the specified pattern is matched, without matching it. In this case, it'll only match a - that isn't surrounded on both sides by numeric characters (the \ds), but the match will only be the - itself.

    In this case, (?<!\d) is a negative lookbehind - it will look backwards to see if the immediately preceding string does not match the pattern. If it does, it reports the match as failed and moves on. Likewise, (?!\d) is a negative lookahead - it does precisely the same thing, but in the opposite direction. Because the - is sandwiched between them, the effect is "match only a - if it does not have numeric characters on both sides".