I'm crawling a website which contains pages in the format of root/page_number.html
where page_num
are contiguous integers. The website would block me if I crawl too quickly, so I think it would be a good idea to crawl using AWS Lambda, so that the IP address would be constantly rotated each time a new function call is made.
Then I write the function that each time it crawls only 100 pages in order to make sure many parallel tasks would be spawn on hopefully different machines with different IP addresses. It worked fine at the beginning but I was still blocked by the website after I crawl about 100K pages. This makes me wondering:
Most machines don't generally have their own IP addresses on the internet anymore.
You lambdas will be communicating with the internet through a NAT gateway. The NAT gateway will have its own pubic IP address or it will talk to the internet through some kind of egress gateway that has its own public IP.
The web site you are talking to will see all your calls coming from the public IP of the gateway that connects to it. If you have 1000 concurrent connections, the will all come from (roughly) the same IP, but different ports.