Search code examples
phpregexpreg-replacepreg-match

Regular expression - preg_match Latin and Greek characters


I am trying to create a regular expression for any given string.

Goal: remove ALL characters which are not "latin" or "lowercase greek" or "numbers" .

What I have done so far: [^a-z0-9]
This works perfect for latin characters.

When I try this: [^a-z0-9α-ω] no luck. Works BUT leaves out any other symbol like !!#$%@%#$@,`

My knowledge is limited when it comes to regexp. Any help would be much appreciated!

EDIT:
Posted below is the function that matches characters specified and creates a slug out of it, with a dash as a separation character:

        $q_separator = preg_quote('-');
        $trans = array(
            '&.+?;'                 => '',
            '[^a-z0-9 -]'           => '',
            '\s+'                   => $separator,
            '('.$q_separator.')+'   => $separator
        );

        $str = strip_tags($str);

        foreach ($trans as $key => $val){
            $str = preg_replace("#".$key."#i", $val, $str);
        }

        if ($lowercase === TRUE){
            $str = strtolower($str);
        }

        return trim($str, '-');  

So if the string is: OnCE upon a tIME !#% @$$ in MEXIco
Using the function the output will be: once-upon-a-time-in-mexico

This works fine but I want the preg_match also to exclude greek characters.


Solution

  • Ok, can this replace your function?

    $subject = 'OnCEΨΩ é-+@àupon</span> aαθ tIME !#%@$ in MEXIco in the year 1874 <or 1875';
    
    function format($str, $excludeRE = '/[^a-z0-9]+/u', $separator = '-') {
        $str = strip_tags($str);
        $str = strtolower($str);
        $str = preg_replace($excludeRE, $separator, $str);
        $str = trim($str, $separator);
        return $str;
    }
    echo format($subject);
    

    Note that you will loose all characters after a < (cause of strip_tags) until you meet a >


    // Old answer when I tought you wanted to preserve greek characters

    It's possible to build a character range such as α-ω or any strange characters you want! The reason your pattern doesn't work is that you don't inform the regex engine you are dealing with a unicode string. To do that, you must add the u modifier at the end of the pattern. Like that:

    /[^a-z0-9α-ω]+/u
    

    You can use chars hexadecimal code too:

    /[^a-z0-9\x{3B1}-\x{3C9}]+/u 
    

    Note that if you are sure not to have or want to preserve, uppercase Greek chars in your string, you can use the character class \p{Greek} like this :

    /[^a-z0-9\p{Greek}]+/u
    

    (It's a little longer but more explicit)