Search code examples
phpperformanceweb-crawlerquerypath

What are the known or expected impact of using Php/Querypath crawler on a target web server, and how can it be kept to a minimum?


I'm building a php+querypath crawler to prototype an idea. I'm worried that once I run it, the target site might be affected in some way, since it has a large number of relevant pages I want to scrape -- 1361 pages at the moment.

What are the recommendations to keep the impact to a minimum on the target site?


Solution

  • Since you are building a crawler the only impact you can have on the target website is, using up their bandwidth.

    To keep the impact minimum, you can do the following:
    1. While building your crawler, download a sample page of the target site on your computer and test your script on that copy.
    2. Ensure that loop which is running to scrape the 1361 pages is functioning properly and downloading each page only once.
    3. Ensure that your script is downloading only 1 page at a time and optionally include an interval between each fetch so that there is less load on the target server.
    4. Depending on how heavy each page is you can decide to download the entire 1361 pages over hours/days/months.