Search code examples
.htaccessunicodeutf-8cyrillic

Cyrillic in htaccess to block Analytics spam


I have a lot of spam on Google Analytics coming from different domains, and one of them is cyrillic encoded, so I'm in trouble to add it to my .htaccess file.

I want to add с.новым.годом.рф to the .htaccess file to block it, but I don't know how to do that, because the file when saved don't preserve the cyrilic characters.

RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)?с.новым.годом.рф.*$ [NC]

is converted to

RewriteCond %{HTTP_REFERER} ^http(s)?://(www\.)??.?????.?????.??.*$ [NC]

I have searched a way to convert cyrillic to unicode but I had no success. Any suggestion?

Thanks


Solution

  • HTTP headers can't include arbitrary raw Unicode characters, so the Referer header contains an ASCII URI rather than Cyrillic characters in an IRI.

    So you need to use the URI-form in the rule to match. To convert an IRI to a URI you use URL-UTF-8 encoding on path parts, and the IDN algorithm on the hostname.

    eg using Python:

    >>> u'с.новым.годом.рф'.encode('idna')
    'xn--q1a.xn--b1aube0e.xn--c1acygb.xn--p1ai'
    

    So:

    RewriteCond %{HTTP_REFERER} ^https?://(www\.)?xn--q1a\.xn--b1aube0e\.xn--c1acygb\.xn--p1ai.*$ [NC]
    

    It would still be a good idea to find a text editor for your .htaccess files that doesn't destroy perfectly good Unicode characters though.