Search code examples
javaweb-crawlerjsoupdata-collection

Url is not returning the correct html in a webpage (for my Java crawler)


I want to download some images from a webpage, for that I was writing a crawler. I tested couple of crawlers for this page but none worked as I wanted.

In the first step, I collected the links of 770+ camera models (parent_url), then I was thinking of collecting images in each link(child_urls). However, the page is organized in such a way that child_urls are returning the same html as parent_url.

Here is my code to collect camera links:

public List<String> html_compiler(String url, String exp, String atr){
    List<String> outs = new ArrayList<String>(); 
    try {
        Document doc = Jsoup.connect(url).get();

        Elements links = doc.select(exp);
        for (Element link : links) {
            outs.add(link.attr(atr));
            System.out.println("\nlink : " + link.attr(atr));
        }
    } catch (IOException | SelectorParseException e) {
        e.printStackTrace();
    }
    return outs;
}

With this code, I collect the links

String expCam = "tr[class='gallery cameras'] > td[class='title'] > a[href]";
String url = "https://www.dpreview.com/sample-galleries?category=cameras";
String atr = "href";
List<String> cams = html_compiler(url, exp, atr); // This gives me the links of individual cameras

String exp2 = "some expression";
html_compiler(cams.get(0), exp2, "src"); // --> this should give me image links of the first
                                         //camera but webpage returns same html as above

How can I solve this problem? I'd love to hear about other pages which classified images according to camera models. (other than Flickr)

EDIT: e.g. In java the following two links gives the same html.

https://www.dpreview.com/sample-galleries?category=cameras

https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one


Solution

  • To understand how to get the image links, it's important to know how the page loads in a browser. If you click the gallerie link, a javascript event handler will be triggered. The created image viewer then loads the images from the data server. The image links are requested via javascript and thus not visible by just parsing the html. The request URL for the image links is https://www.dpreview.com/sample-galleries/data/get-gallery to get the images in the gallerie you have to add the gallerie id. The gallerie id is provided by the href attribute of the gallerie links. The links look like https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one. In this case 2653563139 is the gallerie id. Take the link given above and add the gallerie id with ?galleryId=2653563139 to the end of the URL to get a json object containing all data needed to create the gallerie. Look for the url fields in the images array to get your images.

    To summarize:

    The link you get from the href attribute: https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one

    The gallerie id: 2653563139

    The request url: https://www.dpreview.com/sample-galleries/data/get-gallery

    The json object you need: https://www.dpreview.com/sample-galleries/data/get-gallery?galleryId=2653563139

    The urls you are looking for inside the json object: "url":"https://3.img-dpreview.com/files/p/TS1800x1200~sample_galleries/2653563139/7864344228.jpg"

    And finally your picture link: https://3.img-dpreview.com/files/p/TS1800x1200~sample_galleries/2653563139/7864344228.jpg

    Comment if you want further explanation.