Search code examples
javaweb-scrapingjsoup

Get Amazon data prices table with Jsoup


I'm trying to use Jsoup to get data table from the website: http://aws.amazon.com/ec2/pricing/

I need to get the data from the tables and I'm trying the first table to begin but the page loads the table after some time.

Document doc = Jsoup.connect(html).get();
Elements tableElements = doc.select("table");
Elements tableHeaderEles = tableElements.select("thead tr th");
Elements tableRowElements = tableElements.select(":not(thead) tr");
Instance ins = new Instance();
for (int i = 0; i < tableRowElements.size(); i++) {
    Element row = tableRowElements.get(i);
    System.out.println("row");
    Elements rowItems = row.select("td");
    for (int j = 0; j < rowItems.size(); j++) {
        System.out.println(rowItems.get(j).text());
    }
    System.out.println();
}

Solution

  • Jsoup:

    • Add a userAgent and a timeout to your connection.
    • Make sure you get the source code correctly.
    • Try out your CSS Selector query on http://try.jsoup.org/.

    PhantomJSDriver:

    If the problem is being caused by Javascript (since JSoup does not support Javascript), then I suggest Selenium + PhantomJSDriver (Ghostdriver), which is used for GUI-less browser automation. With this you can easily navigate through the pages, select elements, submit forms and also perform some scraping. Javascript is also supported.

    You can got through the Selenium documentation here. You will have to download phantomjs.exe file.

    A good tutorial forPhantomJSDriver is given in here

    Config of PhantomJSDriver(from the tutorial):

    DesiredCapabilities caps = new DesiredCapabilities();
    caps.setJavascriptEnabled(true); // not really needed: JS enabled by default
    caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, "C://phantomjs.exe");
    caps.setCapability("takesScreenshot", true);
    WebDriver driver = new PhantomJSDriver(caps);