Search code examples
node.jsexpressmiddlewareidnpunycode

Replace hostname in Node.js + Express.js to decode punycode domain


I'm developing a web-app on a cyrillic domain. Currently, this domain hosts a "parked page", saying the site is under construction. If I access it in Chrome, I see punycode in the address bar. Safari decodes it, though. For the development purposes, I have modified my /etc/hosts file to be able to access localhost via a test cyrillic domain. However, both Chrome and Safari fail to decode the hostname.

I have looked up this issue, but could not find any sensible solution. There is a module for Node.js called punycode. Now, if my req.url contains cyrillic characters, it gets URIComponent-encoded, hence I've written a middleware to decode it:

app.use(function(req, res, next) {
    var url = req.url,
        decoded = decodeURIComponent(url);

    if (url !== decoded) req.url = decoded;
    next();
});

It works fine, I can use cyrillic routing now. But when I try to apply this logic to hostname, it doesn't work:

app.use(function(req, res, next) {
    var hostname = req.hostname,
        decoded = punycode.toUnicode(hostname);

    if (hostname !== decoded) req.hostname = decoded;
    // I have also tried return res.redirect('https://' + decoded + ':' + ...);
    next();
});

Any help is very much appreciated. Thanks!


Solution

  • Ok, so after a research, I figured out that it's pretty much impossible. Host-resolving policies are strictly browser-specific and are there (in regard to IDNs) to prevent hazardous phishing activities. Safari, on the one hand, resolves IDNs from punycode to UTF-8 string, Chrome, on the other hand, does not.

    These hazardous phishing activities may result from domains, containing non-ASCII characters. Consider a set of ASCII letters "o, e, a" and UTF-8 Russian (cyrillic) letters "о, е, а". They pretty much look the same and, hence, indistinguishable for the client. Therefore, a hacker may register a domain which looks just like a well-known one ("paypal.com" with ASCII "a", and "pаypаl.com" with UTF-8 cyrillic "а").

    To prevent such malicious activities, Chrome encodes non-ASCII characters to punycode ("pаypаl.com" with UTF-8 cyrillic "а" will look like "xn--pypl-53dc.com" in the browser address bar to warn the client that it's not the original web-site).

    Sigh, seems like IDNs are not the best solution so far.