Search code examples
rscrapebubble-chart

Seach website for phrase in R


I'd like to understand what applications of machine learning are being developed by the US federal government. The federal government maintains the website FedBizOps that contains contracts. The web site can be searched for a phrase, e.g. "machine learning", and a date range, e.g. "last 365 days" to find relevant contracts. The resulting search produces links that contain a contract summary.

I'd like to be able to pull the contract summaries, given a search term and a date range, from this site.

Is there any way I can scrape the browser rendered data in to R? A similar question exists on web scraping, but I don't know how to change the date range.

Once the information is pulled into R, I'd like to organize the summaries with a bubble chart of key phrases.


Solution

  • This may look like a site that uses XHR via javascript to retrieve the URL contents, but it's not. It's just a plain web site that can easily be scraped via standard rvest & xml2 calls like html_session and read_html. It does keep the Location: URL the same, so it kinda looks like XHR even thought it's not.

    However, this is a <form>-based site, which means you could be generous to the community and write an R wrapper for the "hidden" API and possibly donate it to rOpenSci.

    To that end, I used the curlconverter package on the "Copy as cURL" content from the POST request and it provided all the form fields (which seem to map to most — if not all — of the fields on the advanced search page):

    library(curlconverter)
    
    make_req(straighten())[[1]] -> req
    
    httr::VERB(verb = "POST", url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
        httr::add_headers(Pragma = "no-cache", 
            Origin = "https://www.fbo.gov", 
            `Accept-Encoding` = "gzip, deflate, br", 
            `Accept-Language` = "en-US,en;q=0.8", 
            `Upgrade-Insecure-Requests` = "1", 
            `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.41 Safari/537.36", 
            Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
            `Cache-Control` = "no-cache", 
            Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
            Connection = "keep-alive", 
            DNT = "1"), httr::set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
            sympcsm_cookies_enabled = "1", 
            BALANCEID = "balancer.172.16.121.7"), 
        body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning", 
            `dnf_class_values[procurement_notice][_posted_date]` = "365", 
            search_filters = "search", 
            `_____dummy` = "dnf_", 
            so_form_prefix = "dnf_", 
            dnf_opt_action = "search", 
            dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m", 
            dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+", 
            dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9", 
            dnf_opt_finalize = "0", 
            dnf_opt_mode = "update", 
            dnf_opt_target = "", dnf_opt_validate = "1", 
            `dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice", 
            `dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32", 
            `dnf_class_values[procurement_notice][posted]` = "", 
            `autocomplete_input_dnf_class_values[procurement_notice][agency]` = "", 
            `dnf_class_values[procurement_notice][agency]` = "", 
            `dnf_class_values[procurement_notice][zipstate]` = "", 
            `dnf_class_values[procurement_notice][procurement_type][]` = "", 
            `dnf_class_values[procurement_notice][set_aside][]` = "", 
            mode = "list"), encode = "form")
    

    curlconverter adds the httr:: prefixes to the various functions since you can actually use req() to make the request. It's a bona-fide R function.

    However, most of the data being passed in is browser "cruft" and can be trimmed down a bit and moved into a POST request:

    library(httr)
    library(rvest)
    
    POST(url = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list", 
         add_headers(Origin = "https://www.fbo.gov", 
                     Referer = "https://www.fbo.gov/index?s=opportunity&mode=list&tab=list"), 
         set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
                     sympcsm_cookies_enabled = "1", 
                     BALANCEID = "balancer.172.16.121.7"), 
         body = list(`dnf_class_values[procurement_notice][keywords]` = "machine+learning", 
                     `dnf_class_values[procurement_notice][_posted_date]` = "365", 
                     search_filters = "search", 
                     `_____dummy` = "dnf_", 
                     so_form_prefix = "dnf_", 
                     dnf_opt_action = "search", 
                     dnf_opt_template = "VVY2VDwtojnPpnGoobtUdzXxVYcDLoQW1MDkvvEnorFrm5k54q2OU09aaqzsSe6m", 
                     dnf_opt_template_dir = "Pje8OihulaLVPaQ+C+xSxrG6WrxuiBuGRpBBjyvqt1KAkN/anUTlMWIUZ8ga9kY+", 
                     dnf_opt_subform_template = "qNIkz4cr9hY8zJ01/MDSEGF719zd85B9", 
                     dnf_opt_finalize = "0", 
                     dnf_opt_mode = "update", 
                     dnf_opt_target = "", dnf_opt_validate = "1", 
                     `dnf_class_values[procurement_notice][dnf_class_name]` = "procurement_notice", 
                     `dnf_class_values[procurement_notice][notice_id]` = "63ae1a97e9a5a9618fd541d900762e32", 
                     `dnf_class_values[procurement_notice][posted]` = "", 
                     `autocomplete_input_dnf_class_values[procurement_notice][agency]` = "", 
                     `dnf_class_values[procurement_notice][agency]` = "", 
                     `dnf_class_values[procurement_notice][zipstate]` = "", 
                     `dnf_class_values[procurement_notice][procurement_type][]` = "", 
                     `dnf_class_values[procurement_notice][set_aside][]` = "",
                     mode="list"), 
         encode = "form") -> res
    

    This portion:

         set_cookies(PHPSESSID = "32efd3be67d43758adcc891c6f6814c4", 
                     sympcsm_cookies_enabled = "1", 
                     BALANCEID = "balancer.172.16.121.7")
    

    makes me think you should use html_session or GET at least once on the main URL to establish those cookies in the cached curl handler (which will be created & maintained automagically for you).

    The add_headers() bit may also not be necessary but that's an exercise left for the reader.

    You can find the table you're looking for via:

    content(res, as="text", encoding="UTF-8") %>% 
      read_html() %>% 
      html_nodes("table.list") %>% 
      html_table() %>% 
      dplyr::glimpse()
    ## Observations: 20
    ## Variables: 4
    ## $ Opportunity            <chr> "NSN: 1650-01-074-1054; FILTER ELEMENT, FLUID; WSIC: L SP...
    ## $ Agency/Office/Location <chr> "Defense Logistics Agency DLA Acquisition LocationsDLA Av...
    ## $ Type /  Set-aside      <chr> "Presolicitation", "Presolicitation", "Award", "Award", "...
    ## $ Posted On              <chr> "Sep 28, 2016", "Sep 28, 2016", "Sep 28, 2016", "Sep 28, ...
    

    There's an indicator on the page saying these are results "1 - 20 of 2008". You need to scrape that as well and deal with the paginated results. This is also left as an exercise to the reader.