Search code examples
phpregex

How to use `preg_replace` to remove repeated chars with spaces around


I have multiple strings and I need to remove repeated chars. For example: the string here abbbbbc x should become here abc x or the string test jjka should become test jka.

After studying, I came up with the code below which works fine (it uses PHP but you can use any language):

echo preg_replace("/([a-z])\\1+/","$1","test ajjjo new");

The code above will output test ajo new which is great!

My problem now, is that I need to only replace the repeated chars if they are inside a word or at the beggining of end of the word. For example: I need the string here bbb cca to become here bbb ca and the string test hjjjja ppp to become test hja ppp. I tried negating the (space) and ^ and $ but it all becomes a mess pretty fast.

How would you recommend me?


Solution

  • Simpler solution, as I thought there ought to be (making use of the "best regex trick ever" (https://www.rexegg.com/regex-best-trick.html):

    \b(?<whole_word>[a-z])\k{whole_word}++\b(*SKIP)(*FAIL)|(?<not_whole_word>[a-z])\k{not_whole_word}++
    

    which is the exact same (but less compact than what @Wiktor Stribiżew commented):

    \b([a-z])\1+\b(*SKIP)(*F)|([a-z])\2+
    

    and replace with:

    $not_whole_word
    

    See: https://regex101.com/r/pa0GjG/1


    Explaination:

    • \b if you find a whole word, ie. a word boundary
    • (?<whole_word>[a-z])\k{whole_word}++ followed by a character that makes up the whole word until the
    • \b end of the word
    • (*SKIP)(*FAIL) then not match
      • | in every other case
    • (?<not_whole_word>[a-z]) match a character that is
    • \k{not_whole_word}++ repeated

    OLD IDEA


    You could use:

    (?:(\b)|\B)(?!\k{char})(?<anything>.)(?<char>[a-z])\k{char}++(?(1)\B)
    
    

    and replace with

    $anything$char
    

    See: https://regex101.com/r/yCNKY1/1

    I guess there is a more obvious answer but this should work also.


    • (?:(\b)|\B) check, whether you are at the beginning of a word or not. If so group 1 will be set.
      • (?!\k{char}) check that the character of interest is not preceeded by itself
      • (?<anything>.) i.e. it must be preceeded by anything other
        • (?<char>[a-z]) match the character
        • \k{char}++ match all number of repetitions and do not give them up
    • (?(1)\B) ensure, that if the start of the match was the start of a word, you are now not at the end -> you cannot match a complete word.