Search code examples
javascriptweb-scrapingphantomjsscreen-scrapingwebgrabber

PhantomJS querySelectorAll().textcontent returns nothing


I create a simple web scraper to grab data from a website by using phantomjs. It's doesn't work for me when I used querySelectorAll to get content which I want. Here is my whole code.

 var page = require('webpage').create();

var url = 'https://www.google.com.kh/?gws_rd=cr,ssl&ei=iE7jV87UKsrF0gSDw4zAAg';

page.open(url, function(status){

  if(status === 'success'){

    var title = page.evaluate(function(){
      return document.querySelectorAll('.logo-subtext')[0].textContent;
    });

    console.log(title);
  }
  phantom.exit();
});

Please help me to solve this out.

Really thanks.


Solution

  • By default the virtual screen size of PhantomJS is 400x300.

    var page = require('webpage').create();
    console.log(page.viewportSize.width);
    console.log(page.viewportSize.height);
    

    400
    300

    There are sites that take note of that and instead of the normal version that you see in your desktop browser they present a mobile, stripped version of the HTML and CSS. But we can fix that by setting the desired viewport size:

    page.viewportSize = { width: 1280, height: 800 };
    

    There are also sites that do useragent sniffing and make decisions based on that. If they don't know your browser, they can show a mobile version to be on the safe side, or if they don't want to be scraped they could deny connection to PhantomJS, because it honestly declares itself:

    console.log(page.settings.userAgent);
    

    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1

    But we can set the desired user agent:

     page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0';
    

    When working with such fragile things and web scraping you really really should take notice of any errors ans system messages you can get.

    So no PhantomJS script should be without onError and onConsoleMessage callbacks:

    page.onError = function (msg, trace) {
        var msgStack = ['ERROR: ' + msg];
        if (trace && trace.length) {
          msgStack.push('TRACE:');
          trace.forEach(function(t) {
            msgStack.push(' -> ' + t.file + ': ' + t.line + (t.function ? ' (in function "' + t.function +'")' : ''));
          });
        }
        console.log(msgStack.join('\n'));
    };   
    
    page.onConsoleMessage = function (msg) {
        console.log(msg);
    };   
    

    Another vital technique of PhantomJS scripts debugging is making screenshots. Are you sure that PhantomJS sees what you see in you Chrome?

     page.render("google.com.png");
    

    Before setting user agent:

    screenshot with native PhantomJS useragent

    After setting Firefox user agent

    after setting Firefox useragent