Search code examples
javascriptpythongoogle-chromeweb-scrapingscreen-scraping

Using the inspect element feature in google chrome to scrape web sites


I am trying to scrape a web site. The traditional HTML parsing through "urllib2.urlopen" from Python or "htmlTreeParse" in R, fail to get the data from the web page. This is done intentionally by the server so that view source won't show the displayed data but when I use the inspect element feature in google chrome (by right-clicking the web site in google chrome) , then I am able to see the data (list of items and their info). My questions is how to programmatically launch the desired pages and save the inspect elements for each page. Alternatively, if I can have a program that will launch these links and somehow use Control-S
to save an html copy of each link to the local disk.


Solution

  • you can use greasemonkey or tampermonkey to do this quite easily. you simply define the url(s) in your userscript, and then navigate to the page to invoke. you can use a top page containing an iframe that navigates to each page on a schedule. when the page shows in the iframe, the userscript runs, and your data is saved.

    the scripting is basic javascript, nothing fancy, let me know if you need a starter. the biggest catch would be downloading the file, a fairly new capability for JS, but simple to do using a download library, like mine (shameless).

    so, basically, you can have a textarea with a list of urls, one per line, grab a line and set the iframe's .src to the url, invoking the userscript. You can drill down into the page with CSS query selectors, or save the whole page, just grab .outerHTML of the tag whose code you need. i'll be happy to illustrate if need be, but once you get it working, you'll never go back to server-server scraping again.

    EDIT:

    A framing dispatcher page to simply load each needed page into an iframe, thus triggering the userScript:

    <html>
    <iframe id=frame1></iframe>
    <script>
    var base="http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto,%20ON&start="; //the part of the url that stays the same
    var pages=[20, 40, 60, 80];  //all the differing url parts to be concat'd at the end
    var delay= 1000 * 30; //30 sec delay, adjust if needed
    var slot=0; //current shown page's index in pages
    
    function doNext(){
      var page=pages[slot++];
      if(!page){ page=pages[slot=0]; }
      frame1.src=base+page;
    }
    
    setInterval(doNext, delay);
    </script>
    </html>
    

    EDIT2: userScript code:

    // ==UserScript==
    // @name       yelp scraper
    // @namespace  http://anon.org
    // @version    0.1
    // @description  grab listing from yelp
    // @match     http://www.yelp.ca/search?cflt=coffee&find_loc=Toronto,%20ON&start=*
    // @copyright  2013, dandavis
    // ==/UserScript==
    
    
    function Q(a,b){var t="querySelectorAll";b=b||document.documentElement;if(!b[t]){return}if(b.split){b=Q(b)[0]}return [].slice.call(b[t](a))||[]}
    
    function download(strData,strFileName,strMimeType){var D=document,A=arguments,a=D.createElement("a"),d=A[0],n=A[1],t=A[2]||"text/plain";a.href="data:"+strMimeType+","+escape(strData);if('download'in a){a.setAttribute("download",n);a.innerHTML="downloading...";D.body.appendChild(a);setTimeout(function(){var e=D.createEvent("MouseEvents");e.initMouseEvent("click",true,false,window,0,0,0,0,0,false,false,false,false,0,null);a.dispatchEvent(e);D.body.removeChild(a);},66);return true;};var f=D.createElement("iframe");D.body.appendChild(f);f.src="data:"+(A[2]?A[2]:"application/octet-stream")+(window.btoa?";base64":"")+","+(window.btoa?window.btoa:escape)(strData);setTimeout(function(){D.body.removeChild(f);},333);return true;}
    
    window.addEventListener("load", function(){
      var code=Q("#businessresults")[0].outerHTML;
      download(code, "yelp_page_"+location.href.split("start=")[1].split("&")[0]+".txt", "x-application/nothing");
    });
    

    note that it saves the html as .txt to avoid a chrome warning about potentially harmful files. you can rename them in bulk, or try making up a new extension and associating it with a browser.

    EDIT: forgot to mention to turn off file-saving confirmation in chrome for un-attended use: Settings\Show advanced settings...\Ask where to save each file before downloading (uncheck it)