Search code examples
rweb-scraping

Scraping Javascript generated data


I'm working on a project with the World Bank analyzing their procurement processes.

The WB maintains websites for each of their projects, containing links and data for the associated contracts issued (example). Contract-related data is available under the procurement tab.

I'd like to be able to pull a project's contract information from this site, but the links and associated data are generated using embedded Javascript, and the URLs of the pages displaying contract awards and other data don't seem to follow a discernable schema (example).

Is there any way I can scrape the browser rendered data in the first example through R?


Solution

  • The main page calls a javascript function

    javascript:callTabContent('p','P090644','','en','procurement','procurementId');
    

    The main thing here is the project id P090644. This together with the required language en are passed as parameters to a form at http://www.worldbank.org/p2e/procurement.html.

    This form call can be replicated with a url http://www.worldbank.org/p2e/procurement.html?lang=en&projId=P090644.

    Code to extract relevant project description urls follows:

    projID<-"P090644"
    projDetails<-paste0("http://www.worldbank.org/p2e/procurement.html?lang=en&projId=",projID)
    
    require(XML)
    
    pdData<-htmlParse(projDetails)
    pdDescribtions<-xpathSApply(pdData,'//*/table[@id="contractawards"]//*/@href')
    
    #> pdDescribtions
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005718" 
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005702" 
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005709" 
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005715" 
    

    it should be noted that excel links are provided which maybe of use to you also. They may contain the data you intend to scrap from the description links

    procNotice<-paste0("http://search.worldbank.org/wprocnotices/projectdetails/",projID,".xls")
    conAward<-paste0("http://search.worldbank.org/wcontractawards/projectdetails/",projID,".xls")
    conData<-paste0("http://search.worldbank.org/wcontractdata/projectdetails/",projID,".xls")
    
    require(gdata)
    
    pnData<-read.xls(procNotice)
    caData<-read.xls(conAward)
    cdData<-read.xls(conData)
    

    UPDATE:

    To find what is being posted we can examine what happens when the javascript function is called. Using Firebug or something similar we intercept the request header which starts:

    POST /p2e/procurement.html HTTP/1.1
    Host: www.worldbank.org
    

    and has parameters:

    lang=en
    projId=P090644
    

    Alternatively we can examine the javascript at http://siteresources.worldbank.org/cached/extapps/cver116/p2e/js/script.js and look at the function callTabContent:

    function callTabContent(tabparam, projIdParam, contextPath, langCd, htmlId, anchorTagId) {
        if (tabparam == 'n' || tabparam == 'h') {
            $.ajax( {
                type : "POST",
                url : contextPath + "/p2e/"+htmlId+".html",
                data : "projId=" + projIdParam + "&lang=" + langCd,
                success : function(msg) {
                    if(tabparam=="n"){
                        $("#newsfeed").replaceWith(msg);
                    } else{
                        $("#cycle").replaceWith(msg);
                    }
                    stickNotes();
                }
            });
        } else {
            $.ajax( {
                type : "POST",
                url : contextPath + "/p2e/"+htmlId+".html",
                data : "projId=" + projIdParam + "&lang=" + langCd,
                success : function(msg) {
                    $("#tabContent").replaceWith(msg);
                    $('#map_container').hide();
                    changeAlternateColors();
                    $("#tab_menu a").removeClass("selected");
                    $('#'+anchorTagId).addClass("selected");                
                    stickNotes();
                }
            });
        }
    }
    

    examining the content of the function we can see it is simply posting relevant parameters to a form then updating the webpage.