Search code examples
phppunycode

is there possible to write my own punycode converter in php without intl extension?


I do not have that much control of the remote server to install extensions, php is 5.3.8. But I've noticed that there is possible to split utf-8 string with pcre.

So for example: preg_split('@@u','bücher',-1,PREG_SPLIT_NO_EMPTY);

gives: Array ( [0] => b, [1] => ├╝, [2] => c, [3] => h, [4] => e, [5] => r )

or for chinese word: 中国/中华 it gives: Array ( [0] => ńŞş, [1] => ňŤŻ, [2] => /, [3] => ńŞş, [4] => ňŹÄ )

(the results are from non-unicode display), but it is clear that it is possible to split an utf-8 string without international extensions and then (I think) it should be possible to get character codes and do calculations with them to create ascii url.


Solution

  • The only things you need to know is the bitmasks that signal double,triple,quad byte code points:

    Table from http://en.wikipedia.org/wiki/UTF-8

    Bits  Last Code Point  Octet 1  Octet 2  Octet 3  Octet 4
    
     7    U+007F           0xxxxxxx    -/-      -/-      -/-
    11    U+07FF           110xxxxx 10xxxxxx    -/-      -/-
    16    U+FFFF           1110xxxx 10xxxxxx 10xxxxxx    -/-
    21    U+10FFFF         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    

    I don't speak php, but I'm quite sure existing code can be found that uses the shown bitmasks to scan a utf-8 char sequence without actually interpreting it