Search code examples
javascriptxmlhttprequestcross-domaingreasemonkeytampermonkey

How to use XMLHttpRequest to download an HTML page in the background and extract a text element from it?


I want to make a Greasemonkey script that, while you are in URL_1, the script parses the whole HTML web page of URL_2 in the background in order to extract a text element from it.

To be specific, I want to download the whole page's HTML code (a Rotten Tomatoes page) in the background and store it in a variable and then use getElementsByClassName[0] in order to extract the text I want from the element with class name "critic_consensus".


I've found this in MDN: HTML in XMLHttpRequest so, I ended up in this unfortunately non-working code:

var xhr = new XMLHttpRequest();
xhr.onload = function() {
  alert(this.responseXML.getElementsByClassName(critic_consensus)[0].innerHTML);
}
xhr.open("GET", "http://www.rottentomatoes.com/m/godfather/",true);
xhr.responseType = "document";
xhr.send();

It shows this error message when I run it in Firefox Scratchpad:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://www.rottentomatoes.com/m/godfather/. This can be fixed by moving the resource to the same domain or enabling CORS.


PS. The reason why I don't use the Rotten Tomatoes API is that they've removed the critics consensus from it.


Solution

  • For cross-origin requests, where the fetched site has not helpfully set a permissive CORS policy, Greasemonkey provides the GM_xmlhttpRequest() function. (Most other userscript engines also provide this function.)

    GM_xmlhttpRequest is expressly designed to allow cross-origin requests.

    To get your target information create a DOMParser on the result. Do not use jQuery methods as this will cause extraneous images, scripts and objects to load, slowing things down, or crashing the page.

    Here's a complete script that illustrates the process:

    // ==UserScript==
    // @name        _Parse Ajax Response for specific nodes
    // @include     http://stackoverflow.com/questions/*
    // @require     http://ajax.googleapis.com/ajax/libs/jquery/2.1.0/jquery.min.js
    // @grant       GM_xmlhttpRequest
    // ==/UserScript==
    
    GM_xmlhttpRequest ( {
        method: "GET",
        url:    "http://www.rottentomatoes.com/m/godfather/",
        onload: function (response) {
            var parser  = new DOMParser ();
            /* IMPORTANT!
                1) For Chrome, see
                https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#DOMParser_HTML_extension_for_other_browsers
                for a work-around.
    
                2) jQuery.parseHTML() and similar are bad because it causes images, etc., to be loaded.
            */
            var doc         = parser.parseFromString (response.responseText, "text/html");
            var criticTxt   = doc.getElementsByClassName ("critic_consensus")[0].textContent;
    
            $("body").prepend ('<h1>' + criticTxt + '</h1>');
        },
        onerror: function (e) {
            console.error ('**** error ', e);
        },
        onabort: function (e) {
            console.error ('**** abort ', e);
        },
        ontimeout: function (e) {
            console.error ('**** timeout ', e);
        }
    } );