Search code examples
perlidnpunycode

Perl: Homograph attacks. It is possible to compare ascii / non-ascii strings, visually similar?


I faced this so called "homograph attack" and I want to reject domains where decoded punycode visually seems to be alphanumeric only. For example, www.xn--80ak6aa92e.com will display www.apple.com in browser (Firefox). Domains are visually the same, but character set is different. Chrome already patched this and browser display the punycode.

I have example below.

#!/usr/bin/perl

use strict;
use warnings;

use Net::IDN::Encode ':all';
use utf8;                             


my $testdomain = "www.xn--80ak6aa92e.com";
my $IDN = domain_to_unicode($testdomain);
my $visual_result_ascii = "www.apple.com";

print "S1: $IDN\n";
print "S2: $visual_result_ascii";
print "MATCH" if ($IDN eq $visual_result_ascii);

Visually are the same, but they won't match. It is possible to compare an unicode string ($IDN) against an alphanumeric string, visually the same?


Solution

  • After some research and thanks to your comments, I have a conclusion now. The most frequent issues are coming from Cyrillic. This set contains a lot of visually-similar to Latin characters and you can do many combinations.

    I have identified some scammy IDN domains including these names:

    "аррӏе" "сһаѕе" "сіѕсо"
    

    Maybe here, with this font, you can see a difference, but in browser is absolutely no visual difference.

    Consulting https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode I was able to create a table with 12 visually similar characters.

    Update: I found 4 more Latin-like characters in Cyrillic charset, 16 in total now.

    It is possible to create many combinations between these, to create IDNs 100% visually-similar to legit domains.

    0430 a CYRILLIC SMALL LETTER A
    0441 c CYRILLIC SMALL LETTER ES
    0501 d CYRILLIC SMALL LETTER KOMI DE
    0435 e CYRILLIC SMALL LETTER IE
    04bb h CYRILLIC SMALL LETTER SHHA 
    0456 i CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I 
    0458 j CYRILLIC SMALL LETTER JE
    043a k CYRILLIC SMALL LETTER KA
    04cf l CYRILLIC SMALL LETTER PALOCHKA 
    043e o CYRILLIC SMALL LETTER O
    0440 p CYRILLIC SMALL LETTER ER
    051b q CYRILLIC SMALL LETTER QA 
    0455 s CYRILLIC SMALL LETTER DZE
    051d w CYRILLIC SMALL LETTER WE 
    0445 x CYRILLIC SMALL LETTER HA
    0443 y CYRILLIC SMALL LETTER U
    

    The problem is happening with second level domain. Extensions can also be IDN, but they are verified, can not be spoofed and not subject of this issue. Domain registrar will check if all letters are from the same set. IDN will not be accepted if you have a mix of Latin,non-Latin characters. So, extra validation is pointless.

    My idea is simple. We split the domain and only decode SLD part, then we match against a visually-similar Cyrillic list. If all letters are visually similar to Latin, then result is almost sure scam.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use utf8;
    use open ':std', ':encoding(UTF-8)';
    use Net::IDN::Encode ':all';
    use Array::Utils qw(:all);
    
    my @latinlike_cyrillics = qw (0430 0441 0501 0435 04bb 0456 0458 043a 04cf 043e 0440 051b 0455 051d 0445 0443);
    
    # maybe you can find better examples
    my $domain1 = "www.xn--80ak6aa92e.com";
    my $domain2 = "www.xn--d1acpjx3f.xn--p1ai";
    
    test_domain ($domain1);
    test_domain ($domain2);
    
    sub test_domain {
        my $testdomain = shift;
        my ($tLD, $sLD, $topLD) = split(/\./, $testdomain);
        my $IDN = domain_to_unicode($sLD);
    
        my @decoded; push (@decoded,sprintf("%04x", ord)) for ( split("", $IDN) );
    
        my @checker = array_minus( @decoded, @latinlike_cyrillics );
        if (@checker){print "$testdomain [$IDN] seems to be ok\n"}
        else {print "$testdomain [$IDN] is possibly scam\n"}
    }