Search code examples
image-processingweb-scrapingfile-get-contents

How to block images from being scraped by file_get_contents or wget, and how to counter it?


My client is writing blogs on Sina Blog and she is only comfortable with its editor. So after she submit a blog I use a small snippet to scrape the images and texts to her own blog website. Its core is

$url = 'http://s5.sinaimg.cn/bmiddle/001MEJWgzy7xxRaXmDyd4&690';
$img_data = @file_get_contents($url);
$img = file_put_contents('1.jpg',$img_data);

As weird as it sounds, it did work very well and saved us both tons of time. But recently the images became all blank with some watermarks. I guess Sina finally detected our little dirty trick and block the images from being scraped. I am just curious how the block is conducted and more importantly, is there anyway to work around? I've tried using wget http://s5.sinaimg.cn/bmiddle/001MEJWgzy7xxRaXmDyd4&690 it can also only get the blank image.


Solution

  • Just a suggestion - the easiest (and the most likely) way a site would go about detecting a scraper is by looking at the request headers, most commonly "Accept", "Referrer" and "User-Agent". You could try copying the values that your "real" browser sends and plugging them into the wget call, like so:

    Hope that helps!