I am trying to create a regular expression for any given string.
Goal: remove ALL characters which are not "latin" or "lowercase greek" or "numbers" .
What I have done so far: [^a-z0-9]
This works perfect for latin characters.
When I try this: [^a-z0-9α-ω]
no luck. Works BUT leaves out any other symbol like !!#$%@%#$@,`
My knowledge is limited when it comes to regexp. Any help would be much appreciated!
EDIT:
Posted below is the function that matches characters specified and creates a slug out of it, with a dash as a separation character:
$q_separator = preg_quote('-');
$trans = array(
'&.+?;' => '',
'[^a-z0-9 -]' => '',
'\s+' => $separator,
'('.$q_separator.')+' => $separator
);
$str = strip_tags($str);
foreach ($trans as $key => $val){
$str = preg_replace("#".$key."#i", $val, $str);
}
if ($lowercase === TRUE){
$str = strtolower($str);
}
return trim($str, '-');
So if the string is: OnCE upon a tIME !#% @$$ in MEXIco
Using the function the output will be: once-upon-a-time-in-mexico
This works fine but I want the preg_match also to exclude greek characters.
Ok, can this replace your function?
$subject = 'OnCEΨΩ é-+@àupon</span> aαθ tIME !#%@$ in MEXIco in the year 1874 <or 1875';
function format($str, $excludeRE = '/[^a-z0-9]+/u', $separator = '-') {
$str = strip_tags($str);
$str = strtolower($str);
$str = preg_replace($excludeRE, $separator, $str);
$str = trim($str, $separator);
return $str;
}
echo format($subject);
Note that you will loose all characters after a <
(cause of strip_tags) until you meet a >
// Old answer when I tought you wanted to preserve greek characters
It's possible to build a character range such as α-ω or any strange characters you want! The reason your pattern doesn't work is that you don't inform the regex engine you are dealing with a unicode string. To do that, you must add the u
modifier at the end of the pattern. Like that:
/[^a-z0-9α-ω]+/u
You can use chars hexadecimal code too:
/[^a-z0-9\x{3B1}-\x{3C9}]+/u
Note that if you are sure not to have or want to preserve, uppercase Greek chars in your string, you can use the character class \p{Greek}
like this :
/[^a-z0-9\p{Greek}]+/u
(It's a little longer but more explicit)