Search code examples
phpstringutf-8

How to check if letter is upper or lower in PHP?


I have texts in UTF-8 with diacritic characters also, and would like to check if first letter of this text is upper case or lower case. How to do this?


Solution

  • It is my opinion that making a preg_ call is the most direct, concise, and reliable call versus the other posted solutions here.

    echo preg_match('~^\p{Lu}~u', $string) ? 'upper' : 'lower';
    

    My pattern breakdown:

    ~      # starting pattern delimiter 
    ^      #match from the start of the input string
    \p{Lu} #match exactly one uppercase letter (unicode safe)
    ~      #ending pattern delimiter 
    u      #enable unicode matching
    

    Please take notice when ctype_ and < 'a' fail with this battery of tests.

    Code: (Demo)

    $tests = ['âa', 'Bbbbb', 'Éé', 'iou', 'Δδ'];
    
    foreach ($tests as $test) {
        echo "\n{$test}:";
        echo "\n\tPREG:  " , preg_match('~^\p{Lu}~u', $test)      ? 'upper' : 'lower';
        echo "\n\tCTYPE: " , ctype_upper(mb_substr($test, 0, 1))  ? 'upper' : 'lower';
        echo "\n\t< a:   " , mb_substr($test, 0, 1) < 'a'         ? 'upper' : 'lower';
    
        $chr = mb_substr ($test, 0, 1, "UTF-8");
        echo "\n\tMB:    " , mb_strtoupper($chr, "UTF-8") == $chr ? 'upper' : 'lower';
    }
    

    Output:

    âa:
        PREG:  lower
        CTYPE: lower
        < a:   lower
        MB:    lower
    Bbbbb:
        PREG:  upper
        CTYPE: upper
        < a:   upper
        MB:    upper
    Éé:               <-- trouble
        PREG:  upper
        CTYPE: lower  <-- uh oh
        < a:   lower  <-- uh oh
        MB:    upper
    iou:
        PREG:  lower
        CTYPE: lower
        < a:   lower
        MB:    lower
    Δδ:               <-- extended beyond question scope
        PREG:  upper  <-- still holding up
        CTYPE: lower
        < a:   lower
        MB:    upper  <-- still holding up
    

    If anyone needs to differentiate between uppercase letters, lowercase letters, and non-letters see this post.


    It may be extending the scope of this question too far, but if your input characters are especially squirrelly (they might not exist in a category that Lu can handle), you may want to check if the first character has case variants:

    \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).

    To include Roman Numerals ("Number Letters") with SMALL variants, you can add that extra range to the pattern if necessary.

    https://www.fileformat.info/info/unicode/category/Nl/list.htm

    Code: (Demo)

    echo preg_match('~^[\p{Lu}\x{2160}-\x{216F}]~u', $test) ? 'upper' : 'not upper';
    

    Premature-Update: After mb_ucfirst() and mb_lcfirst() are brought into the language, then there might be a reliable non-regex approach.

    echo mb_ucfirst($test) === $test ? 'upper (or first character is not a letter)' : 'lower';
    

    Or

    echo mb_lcfirst($test) !== $test ? 'upper' : 'lower (or first character is not a letter)';
    

    PHP RFC: Multibyte for ucfirst, lcfirst functions, mb_ucfirst mb_lcfirst

    I'll need to check if there are locale specific caveats like with Vietnamese.