Search code examples
phpstringunicodeutf-8

How would you create a string of all UTF-8 characters?


There are many ways to represent the +1 million UTF-8 characters. Take the latin capital "A" with macron (Ā). This is unicode code point U+0100, hex number 0xc4 0x80, decimal number 196 128, and binary 11000100 10000000.

I would like to create a collection of the first 65,535 UTF-8 characters for use in testing applications. These are all unicode characters up to code point U+FFFF (byte3).

Is it possible to do something like a for($x=0) loop and then convert the resulting decimal to another base (like hex) which would allow the creation of the matching unicode character?

I can create the value Ā using something like this:

$char = "\xc4\x80";
// or
$char = chr(196).chr(128);

However, I am not sure how to turn this into an automated process.

// fail!
$char = "\x". dechex($a). "\x". dexhex($b);

Solution

  • You can leverage iconv (or a few other functions) to convert a code point number to a UTF-8 string:

    function unichr($i)
    {
        return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
    }
    
    $codeunits = array();
    for ($i = 0; $i<0xD800; $i++)
        $codeunits[] = unichr($i);
    for ($i = 0xE000; $i<0xFFFF; $i++)
        $codeunits[] = unichr($i);
    $all = implode($codeunits);
    

    (I avoided the surrogate range 0xD800–0xDFFF as they aren't valid to put in UTF-8 themselves; that would be “CESU-8”.)