Search code examples
javaseleniumjsouphtmlunit

best way to parse google custom search engine results


I need to parse through the results of google custom search engine. My first issue is that it is all in javascript. below page loads the results to be parsed, which opens in a js popup.

<script>
function gcseCallback() {
  if (document.readyState != 'complete')
    return google.setOnLoadCallback(gcseCallback, true);
  google.search.cse.element.render({gname:'gsearch', div:'results', tag:'searchresults-only', attributes:{linkTarget:''}});
  var element = google.search.cse.element.getElement('gsearch');
  element.execute('lectures');
};
window.__gcse = {
  parsetags: 'explicit',
  callback: gcseCallback
};
(function() {
  var cx = 'xxxxxx:xxxxxxx';
  var gcse = document.createElement('script');
  gcse.type = 'text/javascript';
  gcse.async = true;
  gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
    '//www.google.com/cse/cse.js?cx=' + cx;
  var s = document.getElementsByTagName('script')[0];
  s.parentNode.insertBefore(gcse, s);

})();
</script>
<div id="results"></div>

What I have already tried with no success. Selenium Jsoup HtmlUnit

they never load the results. I know if I put waits in, it will load the JS but that is not the case with google custom search engine. The data in div id=results never loads in any of the above. Data such as css, js page calls load but not the actual results. I need to do this in java. Is there a better way to do this?

Is it possible to force the page to load directly with html without any javascript loads? If this was in html, of course, it would be much easier. Maybe there is a way to convert to html after javascript load?

Selenium Example

package raTesting;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class Testing {

    public static void main(String[] args)
    {
        WebDriver driver = new HtmlUnitDriver(BrowserVersion.CHROME);

        driver.get("https://www.google.com/cse/publicurl?q=breaking&cx=005766509181136893168:j_finnh-2pi");

        System.out.println(driver.getPageSource());

          }

when the url loads it displays all the results that need to be scanned. but the source never comes back with any results.


Solution

  • for anyone still looking. Alter the code below to fit your needs. You put your procedure into method(s) and run that in function check(). Anything inside the function will be looped until it has looped the array.

    *Known issue: * capserjs runs faster than google js. The result is duplicate links. I haven't been able to tell casperjs to wait for google js popup to close first.

    var casper = require("casper").create({
        verbose: true
    });
    url = casper.cli.get(0)
    // The base links array
    var links = [
        url
    ];
    
    // If we don't set a limit, it could go on forever
    var upTo = ~~casper.cli.get(0) || 10;
    
    var currentLink = 0;
    
    // Get the links, and add them to the links array
    // (It could be done all in one step, but it is intentionally splitted)
    function addLinks(link) {
        this.then(function() {
            var found = this.evaluate(searchLinks);
            this.echo(found.length + " links found on " + link);
            links = links.concat(found);
        });
    }
    
    // Fetch all <a> elements from the page and return
    // the ones which contains a href starting with 'http://'
    function searchLinks() {
        var filter, map;
        filter = Array.prototype.filter;
        map = Array.prototype.map;
        return map.call(filter.call(document.querySelectorAll("a"), function(a) {
            return (/^http:\/\/.*/i).test(a.getAttribute("href"));
        }), function(a) {
            return a.getAttribute("href");
        });
    }
    
    // Just opens the page and prints the title
    function start(link) {
        this.start(link, function() {
            this.echo('Page title: ' + this.getTitle());
        });
    }
    
    // As long as it has a next link, and is under the maximum limit, will keep running
    function check() {
        if (links[currentLink] && currentLink < upTo) {
            this.echo('--- Link ' + currentLink + ' ---');
            start.call(this, links[currentLink]);
            addLinks.call(this, links[currentLink]);
            currentLink++;
            this.run(check);
        } else {
            this.echo("All done.");
            this.exit();
        }
    }
    
    casper.start().then(function() {
        this.echo("Starting");
    });
    
    casper.run(check);
    

    src: http://code.ohloh.net/file?fid=VzTcq4GkQhozuKWkprFfBghgXy4&cid=ZDmcCGgIq6k&s=&fp=513476&mp&projSelected=true#L0