I am trying to make a basic Instagram web scraper, both art inspiration pictures and just generally trying to boost my knowledge and experience programming.
Currently the issue that I am having is that Casper/Phantomjs can't detect higher res images from the srcset, and I can't figure out a way around this. Instagram has their srcsets provide 640x640, 750x750, and 1080x1080 images. I would obviously like to retrieve the 1080, but it seems to be undetectable by any method I've tried so far. Setting the viewport larger does nothing, and I can't retrieve the entire source set through just getting the HTML and splitting it where I need it. And as far as I can tell, there is no other way to retrieve said image than to get it from this srcset.
As I was asked for more details, here I go. This is the code I used to get the attributes from the page:
function getImages() {
var scripts = document.querySelectorAll('._2di5p');
return Array.prototype.map.call(scripts, function (e) {
return e.getAttribute('src');
});
}
Then I do the standard:
casper.waitForSelector('div._4rbun', function() {
this.echo('...found selector ...try getting image srcs now...');
imagesArray = this.evaluate(getImages);
imagesArray.forEach(function (item) {
console.log(item);
However, all that is returned is the lowest resolution of the srcset. Using this url, for example, (https://www.instagram.com/p/BhWS4csAIPS/?taken-by=kasabianofficial) all that is returned is https://instagram.flcy1-1.fna.fbcdn.net/vp/b282bb23f82318697f0b9b85279ab32e/5B5CE6F2/t51.2885-15/s640x640/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg
, which is the lowest resolution (640x640) image in the srcset. Ideally, I'd like to retrieve the https://instagram.flcy1-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg
which is the 1080x1080 image in the srcset. But I can't. There's no way to get that item as far as I can tell. It's completely hidden.
Solution: So my solution was to use slimerjs. If I run the js file through "casperjs --engine=slimerjs fileName.js", I can retrieve srcsets in full. So if I say use this code:
function getImgSrc() {
var scripts = document.querySelectorAll("._2di5p");
return Array.prototype.map.call(scripts, function (e) {
return e.getAttribute("srcset");
});
}
on this url (https://www.instagram.com/p/BhWS4csAIPS/?taken-by=kasabianofficial) I will get (https://instagram.flcy1-1.fna.fbcdn.net/vp/b282bb23f82318697f0b9b85279ab32e/5B5CE6F2/t51.2885-15/s640x640/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg 640w,https://instagram.flcy1-1.fna.fbcdn.net/vp/b4eebf94247af02c63d20320f6535ab4/5B6258DF/t51.2885-15/s750x750/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg 750w,https://instagram.flcy1-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg 1080w) as the result.
This is what I wanted as it means I can scrape those 1080 images. Sorry for this messy page, but I wanted to leave my trail of steps to any of those who might be trying like me.