Search code examples
amazon-web-servicesweb-scrapingaws-lambdaip-address

AWS Lambda for IP Rotation?


I'm crawling a website which contains pages in the format of root/page_number.html where page_num are contiguous integers. The website would block me if I crawl too quickly, so I think it would be a good idea to crawl using AWS Lambda, so that the IP address would be constantly rotated each time a new function call is made.

Then I write the function that each time it crawls only 100 pages in order to make sure many parallel tasks would be spawn on hopefully different machines with different IP addresses. It worked fine at the beginning but I was still blocked by the website after I crawl about 100K pages. This makes me wondering:

  1. Is each machine guaranteed to have an IP address different from another machine in the same region?
  2. If I have ~1000 concurrent tasks running, are they most likely to run on the same machine or different machines?
  3. Is it possible to ensure that a newly launched task will not run on the same machine that is already running another similar task?

Solution

  • Most machines don't generally have their own IP addresses on the internet anymore.

    You lambdas will be communicating with the internet through a NAT gateway. The NAT gateway will have its own pubic IP address or it will talk to the internet through some kind of egress gateway that has its own public IP.

    The web site you are talking to will see all your calls coming from the public IP of the gateway that connects to it. If you have 1000 concurrent connections, the will all come from (roughly) the same IP, but different ports.