Search code examples
javascriptphantomjscasperjs

Casper/Phantomjs unable to retrieve highest resolution src image


I am trying to make a basic Instagram web scraper, both art inspiration pictures and just generally trying to boost my knowledge and experience programming.

Currently the issue that I am having is that Casper/Phantomjs can't detect higher res images from the srcset, and I can't figure out a way around this. Instagram has their srcsets provide 640x640, 750x750, and 1080x1080 images. I would obviously like to retrieve the 1080, but it seems to be undetectable by any method I've tried so far. Setting the viewport larger does nothing, and I can't retrieve the entire source set through just getting the HTML and splitting it where I need it. And as far as I can tell, there is no other way to retrieve said image than to get it from this srcset.

Edit

As I was asked for more details, here I go. This is the code I used to get the attributes from the page:

function getImages() {
var scripts = document.querySelectorAll('._2di5p');
return Array.prototype.map.call(scripts, function (e) {
    return e.getAttribute('src');
});
}

Then I do the standard:

casper.waitForSelector('div._4rbun', function() {
  this.echo('...found selector ...try getting image srcs now...');
    imagesArray = this.evaluate(getImages);
    imagesArray.forEach(function (item) {
    console.log(item);

However, all that is returned is the lowest resolution of the srcset. Using this url, for example, (https://www.instagram.com/p/BhWS4csAIPS/?taken-by=kasabianofficial) all that is returned is https://instagram.flcy1-1.fna.fbcdn.net/vp/b282bb23f82318697f0b9b85279ab32e/5B5CE6F2/t51.2885-15/s640x640/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg, which is the lowest resolution (640x640) image in the srcset. Ideally, I'd like to retrieve the https://instagram.flcy1-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg which is the 1080x1080 image in the srcset. But I can't. There's no way to get that item as far as I can tell. It's completely hidden.


Solution

  • Solution: So my solution was to use slimerjs. If I run the js file through "casperjs --engine=slimerjs fileName.js", I can retrieve srcsets in full. So if I say use this code:

    function getImgSrc() {
      var scripts = document.querySelectorAll("._2di5p");
      return Array.prototype.map.call(scripts, function (e) {
          return e.getAttribute("srcset");
      });
    }
    

    on this url (https://www.instagram.com/p/BhWS4csAIPS/?taken-by=kasabianofficial) I will get (https://instagram.flcy1-1.fna.fbcdn.net/vp/b282bb23f82318697f0b9b85279ab32e/5B5CE6F2/t51.2885-15/s640x640/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg 640w,https://instagram.flcy1-1.fna.fbcdn.net/vp/b4eebf94247af02c63d20320f6535ab4/5B6258DF/t51.2885-15/s750x750/sh0.08/e35/29740443_908390472665088_4690461645690896384_n.jpg 750w,https://instagram.flcy1-1.fna.fbcdn.net/vp/8d20f803e1cb06e394ac91383fd9a462/5B5C9093/t51.2885-15/e35/29740443_908390472665088_4690461645690896384_n.jpg 1080w) as the result.

    This is what I wanted as it means I can scrape those 1080 images. Sorry for this messy page, but I wanted to leave my trail of steps to any of those who might be trying like me.