I want to download some images from a webpage, for that I was writing a crawler. I tested couple of crawlers for this page but none worked as I wanted.
In the first step, I collected the links of 770+ camera models (parent_url
), then I was thinking of collecting images in each link(child_urls
). However, the page is organized in such a way that child_urls
are returning the same html as parent_url
.
Here is my code to collect camera links:
public List<String> html_compiler(String url, String exp, String atr){
List<String> outs = new ArrayList<String>();
try {
Document doc = Jsoup.connect(url).get();
Elements links = doc.select(exp);
for (Element link : links) {
outs.add(link.attr(atr));
System.out.println("\nlink : " + link.attr(atr));
}
} catch (IOException | SelectorParseException e) {
e.printStackTrace();
}
return outs;
}
With this code, I collect the links
String expCam = "tr[class='gallery cameras'] > td[class='title'] > a[href]";
String url = "https://www.dpreview.com/sample-galleries?category=cameras";
String atr = "href";
List<String> cams = html_compiler(url, exp, atr); // This gives me the links of individual cameras
String exp2 = "some expression";
html_compiler(cams.get(0), exp2, "src"); // --> this should give me image links of the first
//camera but webpage returns same html as above
How can I solve this problem? I'd love to hear about other pages which classified images according to camera models. (other than Flickr)
EDIT: e.g. In java the following two links gives the same html.
https://www.dpreview.com/sample-galleries?category=cameras
https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one
To understand how to get the image links, it's important to know how the page loads in a browser. If you click the gallerie link, a javascript event handler will be triggered. The created image viewer then loads the images from the data server. The image links are requested via javascript and thus not visible by just parsing the html. The request URL for the image links is https://www.dpreview.com/sample-galleries/data/get-gallery
to get the images in the gallerie you have to add the gallerie id. The gallerie id is provided by the href
attribute of the gallerie links. The links look like https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one
. In this case 2653563139
is the gallerie id. Take the link given above and add the gallerie id with ?galleryId=2653563139
to the end of the URL to get a json object containing all data needed to create the gallerie. Look for the url
fields in the images
array to get your images.
To summarize:
The link you get from the href
attribute: https://www.dpreview.com/sample-galleries/2653563139/nikon-d1-review-samples-one
The gallerie id: 2653563139
The request url: https://www.dpreview.com/sample-galleries/data/get-gallery
The json object you need: https://www.dpreview.com/sample-galleries/data/get-gallery?galleryId=2653563139
The urls you are looking for inside the json object: "url":"https://3.img-dpreview.com/files/p/TS1800x1200~sample_galleries/2653563139/7864344228.jpg"
And finally your picture link: https://3.img-dpreview.com/files/p/TS1800x1200~sample_galleries/2653563139/7864344228.jpg
Comment if you want further explanation.