How can I modify my regex code for string mutations so that it also works for accented letters? For example a string mutation in reges for "amor" should be the same as the one for "āmōr". I tried to just simply include the accented letters like ´(?<=[aeiouāēīōūăĕĭŏŭ])´ but that did not work.
My code:
$hyphenation = '~
(?<=[aeiou]) #each syllable contain a vowel
(?:
# Muta cum liquida
( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
|
[bcdfghlmnp-tx]
(?:
# ct goes together
[cp] \K (?=t)
|
# two or more consonants are splitted up
\K (?= [bcdfghlmnp-tx]+ [aeiou])
)
|
# a consonant and a vowel go together
(?:
\K (?= [bcdfghlmnp-t] [aeiou])
|
# "x" goes to the preceding vowel
x \K (?= [a-z] | (*SKIP)(*F) )
)
|
# two vowels are splitted up except ae oe...
\K (?= [aeiou] (?<! ae | oe | au | que | qua | quo | qui ) )
)
~xi';
// hyphention
$result = preg_replace($hyphenation, '-$1', $input);
An accented letter can be figured in several ways in unicode. For example ā
can be the unicode code point U+0101 (LATIN SMALL LETTER A WITH MACRON), but it can be also the combination of U+0061 (LATIN SMALL LETTER A) and U+0304 (COMBINING MACRON). (link)
Consequence, writing (?<=[aeiouāēīōūăĕĭŏŭ])
is correct if:
you use the u modifier to inform the pcre regex engine that your string and your pattern must be read as UTF-8 strings. Otherwise multi-byte characters are seen as separated bytes and not as something atomic (This can be problematic and produce weird results in particular when multibyte characters are inside a character class. For example [eā]+
will match "ē").
you are sure that the target string and the pattern use the same form for each letter. If the pattern use U+0101 and the string U+0061 with U+0304 for "ā", it will not work. To prevent this problem, you can apply $str = Normalizer::normalize($str);
to the subject string. This method comes from the intl extension.
You can find more informations following these links:
https://en.wikipedia.org/wiki/Unicode_equivalence
http://utf8-chartable.de/
http://php.net/manual/en/normalizer.normalize.php
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://pcre.org/original/pcre.txt