Search code examples
phpregexunicodezalgo

How to prevent zalgo text using php


I have some problems with Zalgo on my imageboard.

Texts like below mess up my imageboard. Is there a way to prevent these characters and "fix" or clean up the texts?

Example text Source:

ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

I tried to use this solution:

$cleanMessage = preg_replace("/[^\x20-\xAD\x7F]/", "", $input_lines);

Taken from here: Remove special characters that mess with formating But it works only for latin chars Can anyone help me?


Solution

  • This regular expression replaces every superscript symbol in the $text variable:

    $text = preg_replace("~[\p{M}]~uis","", $text);
    

    If $text contains char with superscript, for example กิ this regex will remove that superscript symbol and result $text will contain just .

    I was improved this regex and changed it to filter only second level of phonetic marks

    $text = preg_replace("~(?:[\p{M}]{1})([\p{M}])+?~uis","", $text);
    

    This regex will filter only second level of superscript symbols. Use it if you want to filter deutch or other languages with reserved marks. This regex will transform this word -

    ͐̈ͩ̎Zͮ͌ͦ͆ͦͤÃ̉͛̄ͭ̈̚LͫG̉̋͂̉Oͨ͌̋͗!

    into this: ZÄLͫGO!

    I hope second regex will help you.