Search code examples
perlfaviconwww-mechanize

Find Favicons in HTML using Perl


I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.

I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.

For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."

my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;

I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.

The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:

my $results = $mech->find_all_links(  );

foreach my $result (@{ $results }) {
    my $attrs = $result->attrs();
    #'tag' => "apple-touch-icon"
    
    foreach my $attr (sort keys %{ $attrs }) {
        if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
            say STDERR "I found it:" . $result->url();
        }

        # Add tests for other types of icons here.
        # E.g. "mask-icon" and "shortcut icon."

    }

}

That works, but it seems messy. Is there a better way?


Solution

  • Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:

    link[rel*=icon i][href]
    

    This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].

    Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):

    use v5.10;
    
    use Mojo::UserAgent;
    my $ua = Mojo::UserAgent->new->max_redirects(3);
    
    my $results = $ua->get( shift )
        ->result
        ->dom
        ->find( 'link[rel*=icon i][href]' )
        ->map( attr => 'href' )
        ->to_array
        ;
    
    say join "\n", @$results;
    

    That works pretty well so far:

    $ perl mojo.pl https://www.perl.org
    https://cdn.perl.org/perlweb/favicon.ico
    
    $ perl mojo.pl https://www.microsoft.com
    https://c.s-microsoft.com/favicon.ico?v2
    
    $ perl mojo.pl https://leanpub.com/mojo_web_clients
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
    https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
    

    Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:

    link[rel*=icon][href], link[rel*=ICON][href]
    

    or different values in rel:

    link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
    

    Line up as many of those as you like.

    But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:

    my %Interesting = ...;
    my $results = $ua->get( shift )
        ->result
        ->dom
        ->find( '...' )
        ->grep( sub { exists $Interesting{ $_->attr('rel') } } )
        ->map( attr => 'href' )
        ->to_array
        ;
    

    I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.