Search code examples
phpunicodeconverterspunycode

PHP : issue with idn_to_utf8(). Certain domains are not converted


In a PHP project I use the idn_to_utf8 function to convert domaine name from punycode to unicode string.

But sometimes this function return the punycode and not the unicode string.

Example :

echo idn_to_utf8('xn--fiq57vn0d561bf5ukfonh1o');
// Return : xn--fiq57vn0d561bf5ukfonh1o
// It should return : 中島第2駐輪場
echo idn_to_utf8('xn--fiqu6mnndw87c3ucbt0a1ea684a');
// Return : 中味鋺自転車置場

There are libraries which correctly convert punycode (http://idnaconv.phlymail.de/index.php?encoded=xn--fiq57vn0d561bf5ukfonh1o&decode=%3C%3C+Decode&lang=de) but I prefer use a PHP function than a library.

Do you have any ideas of origins of this problem ?

Edit / Solution and Explanation : To summarize and explain the problem : This code show the problem :

echo idn_to_ascii('吉津第2自転車置場');
?><br /><?php
echo idn_to_utf8(idn_to_ascii('吉津第2自転車置場'));
?> Should be : 吉津第2自転車置場 <br /><?php

This code displays the following :

xn--2-958a11kws1a96p50fgxenr6afga

吉津第2自転車置場 (Should be) : 吉津第2自転車置場

To be more clear : When we get the punycode of 吉津第2自転車置場, before convert this string PHP convert it to 吉津第2自転車置場 (The character "2" is different). So, with idn_to_ascii function we can't convert all unicode characters because PHP convert certain unicode character to others (in this example PHP converts 2 to 2 (sorry for sounding of this "two to "two").


Solution

  • This works fine. I think characters [A-Z0-9] cannot be used.

    echo idn_to_utf8('xn--2-kq6aw43af1e4y9boczagup'); // 中島第2駐輪場
    

    Factually, our chromes will automatically convert 中島第2駐輪場.com into 中島第2駐輪場.com before accessing.

    UPDATED:
    A normalization rule named NAMEPREP seems to be provided: https://www.nic.ad.jp/ja/dom/idn.html

    UPDATED:
    That seems to be invaild... Validation Result