Search code examples
phpregexnon-latin

Remove all special chars, but not non-Latin characters


I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.

function seoUrl($string) {
    // Lower case everything
    $string = strtolower($string);
    // Make alphanumeric (removes all other characters)
    $string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
    // Clean up multiple dashes or whitespaces
    $string = preg_replace('/[\s-]+/', ' ', $string);
    // Convert whitespaces and underscore to dash
    $string = preg_replace('/[\s_]/', '-', $string);
    return $string;
}

Solution

  • You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:

    ~[^\p{Cyrillic}a-z0-9_\s-]~ui
    

    You don't need to double escape \s.

    PHP code:

    preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);