Search code examples
htmlamazon-web-servicesaws-lambdajsoup

Cannot reach some websites within AWS-lambda function


I am creating an AWS lambda function to do some basic web scraping using JSoup. I've setup the necessary VPC and corresponding requirements (I think).

When I execute the lambda function through the AWS testing interface I can successfully connect to basic websites such as google/cnn/etc (https://www.google.com/) and (https://www.cnn.com/).

However, when I try to scrape the website I am interested in

https://www.wordplays.com/crossword-solver/egyptian-snake/

I get an IO exception:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403.

However, when I run the same code locally (on my computer) with that URL it is able to connect and read the website completely fine. This makes me think that my VPC is setup incorrectly but I don't know why I would be able to reach www.google.com and not www.wordplays.com.

This is how I am invoking jsoup:

Document document = Jsoup.connect(html)
     .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36")
     .get();

I am not sure how to move forward since I cannot figure out why I can successfully connect to some websites but not others.


Solution

  • My guess is that the website blocks you. Many websites block Amazon AWS IP address range to protect their data from web crawlers. In fact, the AWS range is likely the most blocked range out there. Behavior depends on the implementation but quite often the website returns 4xx error or lets request time-out.

    You can try to use a proxy server that is out of the AWS range.

    In the case of a larger website, the overcoming of protection might be more complicated and you may need a full browser to do so. My colleague wrote an article on this topic - https://help.apify.com/en/articles/1961361-several-tips-how-to-bypass-website-anti-scraping-protections . But in 99% cases, a proxy server will solve the issue.