Search code examples
phpregexstringspecial-charactersmutation

usage of accented letters in regex in string mutations


How can I modify my regex code for string mutations so that it also works for accented letters? For example a string mutation in reges for "amor" should be the same as the one for "āmōr". I tried to just simply include the accented letters like ´(?<=[aeiouāēīōūăĕĭŏŭ])´ but that did not work.

My code:

$hyphenation = '~
(?<=[aeiou]) #each syllable contain a vowel
(?:
    # Muta cum liquida
    ( (?:[bcdfgpt]r | [bcfgp] l | ph [lr] | [cpt] h | qu ) [aeiou] x )
  |
    [bcdfghlmnp-tx]
    (?:
        # ct goes together

        [cp] \K (?=t)
      |
        # two or more consonants are splitted up
        \K (?= [bcdfghlmnp-tx]+ [aeiou]) 
    )   
  |
    # a consonant and a vowel go together
    (?:
        \K (?= [bcdfghlmnp-t] [aeiou])
      | 
        #  "x" goes to the preceding vowel
        x \K (?= [a-z] | (*SKIP)(*F) ) 
    )
  |
    # two vowels are splitted up except ae oe...
    \K (?= [aeiou] (?<! ae | oe | au | que | qua | quo | qui ) ) 
)
~xi';


// hyphention
$result = preg_replace($hyphenation, '-$1', $input);

Solution

  • An accented letter can be figured in several ways in unicode. For example ā can be the unicode code point U+0101 (LATIN SMALL LETTER A WITH MACRON), but it can be also the combination of U+0061 (LATIN SMALL LETTER A) and U+0304 (COMBINING MACRON). (link)

    Consequence, writing (?<=[aeiouāēīōūăĕĭŏŭ]) is correct if:

    • you use the u modifier to inform the pcre regex engine that your string and your pattern must be read as UTF-8 strings. Otherwise multi-byte characters are seen as separated bytes and not as something atomic (This can be problematic and produce weird results in particular when multibyte characters are inside a character class. For example [eā]+ will match "ē").

    • you are sure that the target string and the pattern use the same form for each letter. If the pattern use U+0101 and the string U+0061 with U+0304 for "ā", it will not work. To prevent this problem, you can apply $str = Normalizer::normalize($str); to the subject string. This method comes from the intl extension.

    You can find more informations following these links:

    https://en.wikipedia.org/wiki/Unicode_equivalence
    http://utf8-chartable.de/
    http://php.net/manual/en/normalizer.normalize.php
    http://php.net/manual/en/reference.pcre.pattern.modifiers.php
    http://pcre.org/original/pcre.txt