Search code examples
php.htaccess

PHP prevent image scrape


I have a project that serves many images. That project also have an API that serves not only but the image links.

I would like to have a way to successfuly avoid the scraping of my images. I don't mind users could download each image individually but would not like that someone could scrape all images at the same time to avoid high bandwith usage.

I though using htaccess to deny direct access to image folders. Also, thought to use in PHP (in website) to use a dynamic link to show the image (for example loadimage.php?id=XXXXX) so my users doesn't know the full image link.

How could I do it in API (and even in website) to prevent scraping? I though something like a token and each request will generate a new "image id", but or I'm missing something or can't figure it out how to make it work.

I know it will be impossible to have a 100% valid method to do it, but any suggestions in how to difficult it would be appreciated.

Thanks.


Solution

  • You're looking for a rate limit policy. It involves tracking how many times the images are being requested (or the number of bytes being exchanged), and issuing a (typically) 429 Too Many Requests response when a threshold is exceeded.

    Nginx has some pretty good built-in tools for rate limiting. You mention .htaccess which implies Apache, for which there is also a rate limiting module.

    You could do this with or without PHP. You could identify a URL pattern that you want rate limited, and apply the rate limit policy to that URL pattern (could be a PHP script or just a directory somewhere).

    For Apache:

    <Location ".../path/to/script.php">
        SetOutputFilter RATE_LIMIT
        SetEnv rate-limit 400 
        SetEnv rate-initial-burst 512
    </Location>
    

    Or, you could write code in your PHP that writes accesses to a database, and enforces a limit based on how many accesses in a given window period.

    I would not generally recommend writing your own when there are such good tools available supported in the web server itself. One exception would be if you use several web servers in a cluster, which cannot easily synchronize rate limiting thresholds and counts across the server.