I faced this so called "homograph attack" and I want to reject domains where decoded punycode visually seems to be alphanumeric only. For example, www.xn--80ak6aa92e.com will display www.apple.com in browser (Firefox). Domains are visually the same, but character set is different. Chrome already patched this and browser display the punycode.
I have example below.
#!/usr/bin/perl
use strict;
use warnings;
use Net::IDN::Encode ':all';
use utf8;
my $testdomain = "www.xn--80ak6aa92e.com";
my $IDN = domain_to_unicode($testdomain);
my $visual_result_ascii = "www.apple.com";
print "S1: $IDN\n";
print "S2: $visual_result_ascii";
print "MATCH" if ($IDN eq $visual_result_ascii);
Visually are the same, but they won't match. It is possible to compare an unicode string ($IDN) against an alphanumeric string, visually the same?
After some research and thanks to your comments, I have a conclusion now. The most frequent issues are coming from Cyrillic. This set contains a lot of visually-similar to Latin characters and you can do many combinations.
I have identified some scammy IDN domains including these names:
"аррӏе" "сһаѕе" "сіѕсо"
Maybe here, with this font, you can see a difference, but in browser is absolutely no visual difference.
Consulting https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode I was able to create a table with 12 visually similar characters.
Update: I found 4 more Latin-like characters in Cyrillic charset, 16 in total now.
It is possible to create many combinations between these, to create IDNs 100% visually-similar to legit domains.
0430 a CYRILLIC SMALL LETTER A
0441 c CYRILLIC SMALL LETTER ES
0501 d CYRILLIC SMALL LETTER KOMI DE
0435 e CYRILLIC SMALL LETTER IE
04bb h CYRILLIC SMALL LETTER SHHA
0456 i CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
0458 j CYRILLIC SMALL LETTER JE
043a k CYRILLIC SMALL LETTER KA
04cf l CYRILLIC SMALL LETTER PALOCHKA
043e o CYRILLIC SMALL LETTER O
0440 p CYRILLIC SMALL LETTER ER
051b q CYRILLIC SMALL LETTER QA
0455 s CYRILLIC SMALL LETTER DZE
051d w CYRILLIC SMALL LETTER WE
0445 x CYRILLIC SMALL LETTER HA
0443 y CYRILLIC SMALL LETTER U
The problem is happening with second level domain. Extensions can also be IDN, but they are verified, can not be spoofed and not subject of this issue. Domain registrar will check if all letters are from the same set. IDN will not be accepted if you have a mix of Latin,non-Latin characters. So, extra validation is pointless.
My idea is simple. We split the domain and only decode SLD part, then we match against a visually-similar Cyrillic list. If all letters are visually similar to Latin, then result is almost sure scam.
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
use Net::IDN::Encode ':all';
use Array::Utils qw(:all);
my @latinlike_cyrillics = qw (0430 0441 0501 0435 04bb 0456 0458 043a 04cf 043e 0440 051b 0455 051d 0445 0443);
# maybe you can find better examples
my $domain1 = "www.xn--80ak6aa92e.com";
my $domain2 = "www.xn--d1acpjx3f.xn--p1ai";
test_domain ($domain1);
test_domain ($domain2);
sub test_domain {
my $testdomain = shift;
my ($tLD, $sLD, $topLD) = split(/\./, $testdomain);
my $IDN = domain_to_unicode($sLD);
my @decoded; push (@decoded,sprintf("%04x", ord)) for ( split("", $IDN) );
my @checker = array_minus( @decoded, @latinlike_cyrillics );
if (@checker){print "$testdomain [$IDN] seems to be ok\n"}
else {print "$testdomain [$IDN] is possibly scam\n"}
}