Search code examples
phpregexregex-negation

Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)


I need to normalize some texts (product descriptions) in regard to the correct usage of .,,,: symbols (no space before and one space after)

The regex I've come up with is this:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);

The problem is that this matches four cases it shouldn't touch:

  • Any decimal number, like 5.5
  • Any thousand separator, like 4,500
  • A "fixed" phrase in Greek, ό,τι
  • The ellipsis symbol, ... - Basically ellipsis is a totally special case, that I'm thinking should be taken care of in a separate preg_replace maybe? I mean, the three dots should be treated as one thing, meaning that some text ... should indeed be matched and converted to some text... but not to some text. . .

Especially for the numeric exception, I know it can be achieved with some negative lookahead/lookbehind but unfortunately I can't combine them in my current pattern.

This is a fiddle for you to check (the cases that shouldn't be matched are in lines 2, 3, 4).

EDIT: Both of the solutions posted below work fine, but end up adding a space after the last fullstop of the description. This is not much of a problem, as earlier in my code, I was taking care of the <br />s and spaces at the beginning and end of the description, so I moved this preg_replace before that one...

So, the final code I ended up using is this:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])(?!(?<=\d.)\d)(?!(?<=ό,)τι)\s*#ui', '$1 ', $variation['DESCRIPTION']);
$variation['DESCRIPTION'] = preg_replace('#^\s*(<br />)*\s*|\s*(<br />)*\s*$#', '', $variation['DESCRIPTION']);

So the only thing that's left to achieve is alter this code so that it treats the ellipsis the way I describe above.

Any help with this last requirement will be very much appreciated! TIA


Solution

  • You can add two lookaheads containing lookbehinds:

    \s*(\.{2,}|[:,.](?!(?<=ό,)τι)(?!(?<=\d.)\d))(?!\s*<br\s*/>)\s*
    

    See the regex demo. Note that I also added \s* to the last lookahead and swapped it with the consuming \s* to fail the match if there is <br/> after any zero or more whitespaces after the :, , or ..

    Details

    • \s* - zero or more whitespaces
    • (\.{2,}|[:,.]) - Group 1: two or more dots, or a :, , or .
    • (?!(?<=ό,)τι) - fail the match if the next two chars are τι preceded with ό,
    • (?!(?<=\d.)\d) - fail the match if the next char is a digit preceded with a digit and any char (note that a . is enough since the [:,.] already match the char allowed/required, here, we just need to "jump" over that matched char)
    • (?!\s*<br\s*/>) - a negative lookahead that fails the match if there are zero or more whitespaces, <br, zero or more whitespaces, /> immediately to the right of the current location.
    • \s* - zero or more whitespaces.