I need to normalize some texts (product descriptions) in regard to the correct usage of .
,,
,:
symbols (no space before and one space after)
The regex I've come up with is this:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);
The problem is that this matches four cases it shouldn't touch:
ό,τι
...
- Basically ellipsis is a totally special case, that I'm thinking should be taken care of in a separate preg_replace
maybe? I mean, the three dots should be treated as one thing, meaning that some text ...
should indeed be matched and converted to some text...
but not to some text. . .
Especially for the numeric exception, I know it can be achieved with some negative lookahead/lookbehind but unfortunately I can't combine them in my current pattern.
This is a fiddle for you to check (the cases that shouldn't be matched are in lines 2, 3, 4).
EDIT: Both of the solutions posted below work fine, but end up adding a space after the last fullstop of the description. This is not much of a problem, as earlier in my code, I was taking care of the <br />
s and spaces at the beginning and end of the description, so I moved this preg_replace before that one...
So, the final code I ended up using is this:
$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])(?!(?<=\d.)\d)(?!(?<=ό,)τι)\s*#ui', '$1 ', $variation['DESCRIPTION']);
$variation['DESCRIPTION'] = preg_replace('#^\s*(<br />)*\s*|\s*(<br />)*\s*$#', '', $variation['DESCRIPTION']);
So the only thing that's left to achieve is alter this code so that it treats the ellipsis the way I describe above.
Any help with this last requirement will be very much appreciated! TIA
You can add two lookaheads containing lookbehinds:
\s*(\.{2,}|[:,.](?!(?<=ό,)τι)(?!(?<=\d.)\d))(?!\s*<br\s*/>)\s*
See the regex demo. Note that I also added \s*
to the last lookahead and swapped it with the consuming \s*
to fail the match if there is <br/>
after any zero or more whitespaces after the :
, ,
or .
.
Details
\s*
- zero or more whitespaces(\.{2,}|[:,.])
- Group 1: two or more dots, or a :
, ,
or .
(?!(?<=ό,)τι)
- fail the match if the next two chars are τι
preceded with ό,
(?!(?<=\d.)\d)
- fail the match if the next char is a digit preceded with a digit and any char (note that a .
is enough since the [:,.]
already match the char allowed/required, here, we just need to "jump" over that matched char)(?!\s*<br\s*/>)
- a negative lookahead that fails the match if there are zero or more whitespaces, <br
, zero or more whitespaces, />
immediately to the right of the current location.\s*
- zero or more whitespaces.