I have multiple strings and I need to remove repeated chars. For example: the string here abbbbbc x
should become here abc x
or the string test jjka
should become test jka
.
After studying, I came up with the code below which works fine (it uses PHP but you can use any language):
echo preg_replace("/([a-z])\\1+/","$1","test ajjjo new");
The code above will output test ajo new
which is great!
My problem now, is that I need to only replace the repeated chars if they are inside a word or at the beggining of end of the word. For example: I need the string here bbb cca
to become here bbb ca
and the string test hjjjja ppp
to become test hja ppp
. I tried negating the
(space) and ^
and $
but it all becomes a mess pretty fast.
How would you recommend me?
Simpler solution, as I thought there ought to be (making use of the "best regex trick ever" (https://www.rexegg.com/regex-best-trick.html):
\b(?<whole_word>[a-z])\k{whole_word}++\b(*SKIP)(*FAIL)|(?<not_whole_word>[a-z])\k{not_whole_word}++
which is the exact same (but less compact than what @Wiktor Stribiżew commented):
\b([a-z])\1+\b(*SKIP)(*F)|([a-z])\2+
and replace with:
$not_whole_word
See: https://regex101.com/r/pa0GjG/1
Explaination:
\b
if you find a whole word, ie. a word boundary(?<whole_word>[a-z])\k{whole_word}++
followed by a character that makes up the whole word until the\b
end of the word(*SKIP)(*FAIL)
then not match
|
in every other case(?<not_whole_word>[a-z])
match a character that is\k{not_whole_word}++
repeatedOLD IDEA
You could use:
(?:(\b)|\B)(?!\k{char})(?<anything>.)(?<char>[a-z])\k{char}++(?(1)\B)
and replace with
$anything$char
See: https://regex101.com/r/yCNKY1/1
I guess there is a more obvious answer but this should work also.
(?:(\b)|\B)
check, whether you are at the beginning of a word or not. If so group 1 will be set.
(?!\k{char})
check that the character of interest is not preceeded by itself(?<anything>.)
i.e. it must be preceeded by anything other
(?<char>[a-z])
match the character\k{char}++
match all number of repetitions and do not give them up(?(1)\B)
ensure, that if the start of the match was the start of a word, you are now not at the end -> you cannot match a complete word.