Search code examples
perlhtmlcrclwp

How to detect a changed webpage?


In my application, I fetch webpages periodically using LWP. Is there anyway to check whether between two consecutive fetches the webpage has got changed in some respect (other than explicitly doing a comparison) ? Is there any signature(say CRC) that is being generated at lower protocol layers which can be extracted and compared against older signatures to see possible changes ?


Solution

  • There are two possible approaches. One is to use a digest of the page, e.g.

    use strict;
    use warnings;
    
    use Digest::MD5 'md5_hex';
    use LWP::UserAgent;
    
    # fetch the page, etc.
    my $digest = md5_hex $response->decoded_content;
    
    if ( $digest ne $saved_digest ) { 
        # the page has changed.
    }
    

    Another option is to use an HTTP ETag, if the server provides one for the resource requested. You can simply store it and then set your request headers to include an If-None-Match field on subsequent requests. If the server ETag has remained the same, you'll get a 304 Not Modified status and an empty response body. Otherwise you'll get the new page. (And new ETag.) See Entity Tags in RFC2616.

    Of course, the server could be lying, and sending the same ETag even though the content has changed. There's no way to know unless you look.