There are many ways to represent the +1 million UTF-8 characters. Take the latin capital "A" with macron (Ā
). This is unicode code point U+0100
, hex number 0xc4 0x80
, decimal number 196 128
, and binary 11000100 10000000
.
I would like to create a collection of the first 65,535 UTF-8 characters for use in testing applications. These are all unicode characters up to code point U+FFFF
(byte3).
Is it possible to do something like a for($x=0)
loop and then convert the resulting decimal to another base (like hex) which would allow the creation of the matching unicode character?
I can create the value Ā
using something like this:
$char = "\xc4\x80";
// or
$char = chr(196).chr(128);
However, I am not sure how to turn this into an automated process.
// fail!
$char = "\x". dechex($a). "\x". dexhex($b);
You can leverage iconv
(or a few other functions) to convert a code point number to a UTF-8 string:
function unichr($i)
{
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
$codeunits = array();
for ($i = 0; $i<0xD800; $i++)
$codeunits[] = unichr($i);
for ($i = 0xE000; $i<0xFFFF; $i++)
$codeunits[] = unichr($i);
$all = implode($codeunits);
(I avoided the surrogate range 0xD800–0xDFFF as they aren't valid to put in UTF-8 themselves; that would be “CESU-8”.)