Search code examples
rweb-scrapingscreen-scrapingdashboardrvest

Scraping PHP dashboard with R (rvest)


I am trying to scrape Bangladesh COVID-19 data (number of tests, number of positive tests, positive rate) from the official website: http://103.247.238.92/webportal/pages/covid19.php

The website contains 3 drop-down menus to arrive at the data: Select Division; Select District; Select time frame for the data.

I have tried the following so far:

url <- "http://103.247.238.92/webportal/pages/covid19.php"
webpage <- read_html(url)

webpage has the following:

List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Since this did not help, I also tried the following based on this question:

a <- GET(url)
a <- content(a, as="text")
a <- gsub("^angular.callbacks._2\\(", "", a)
a <- gsub("\\);$", "", a)
df <- fromJSON(a, simplifyDataFrame = TRUE)

The above returns the following error:

Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <!-- This is a 
                     (right here) ------^

So I am really lost in terms of how I can even read the data - but upon looking at the source of the webpage, I know that the data is right there: Safari Website inspector

Any suggestions on how I can read this data?

Additionally, if someone could help with how I can go about selecting the different drop-down menu items, that would be really appreciated. The final goal is to collect data for each district in each division for the last 12 months.


Solution

  • tl;dr

    The page makes additional requests to pick up that info. Those additional requests rely on combinations of ids; an id pulled from the option element value attribute, of each option within Division dropdown, in tandem with an id pulled from the option element value attribute of each option within the District dropdown.

    You can make an initial request to get all the Division dropdown ids:

    divisions <- options_df("#division option:nth-child(n+2)", "division") 
    

    nth-child(n+2) is used to exclude the initial 'select' option.

    This returns a dataframe with the initial divisionIDs and friendly division names.

    enter image description here

    Those ids can then be used to retrieve the associated districtIDs (the options which become available in the second dropdown after making your selection in the first):

    districts <- pmap_dfr(
      list(divisions$divisionID),
      ~ {
        df_districts <- districts_from_updated_session(.x, "district") %>%
          mutate(
            divisionID = .x
          )
        return(df_districts)
      }
    )
    

    This returns a dataframe mapping the divisionID to all the associated districtIDs, as well as the friendly district names:

    enter image description here

    By including the divisionID in both dataframes I can inner-join them:

    div_district <- dplyr::inner_join(divisions, districts, by = "divisionID", copy = FALSE)
    

    enter image description here

    Up until now, I have been using a session object for the efficiency of tcp re-use. Unfortunately, I couldn't find anything in the documentation covering how to update an already open session allowing for sending a new POST request with dynamic body argument. Instead, I leveraged furrr::future_map to try and gain some efficiencies through parallel processing:

    df <- div_district %>%
      mutate(json = furrr::future_map(divisionID, .f = get_covid_data, districtID))
    

    To get the final covid numbers, via get_covid_data(), I leverage some perhaps odd behaviour of the server, in that I could make a GET, passing divisionID and districtID within the body, then regex out part of the jquery datatables scripting, string clean that into a json valid string, then read that into a json object stored in the json column of the final dataframe.

    enter image description here

    Inside of the json column


    R:

    library(httr)
    #> Warning: package 'httr' was built under R version 4.0.3
    library(rvest)
    #> Loading required package: xml2
    #> Warning: package 'xml2' was built under R version 4.0.3
    library(tidyverse)
    #> Warning: package 'tibble' was built under R version 4.0.3
    #> Warning: package 'forcats' was built under R version 4.0.3
    library(jsonlite)
    #> Warning: package 'jsonlite' was built under R version 4.0.3
    #> 
    #> Attaching package: 'jsonlite'
    #> The following object is masked from 'package:purrr':
    #> 
    #>     flatten
    library(furrr)
    #> Warning: package 'furrr' was built under R version 4.0.3
    #> Loading required package: future
    #> Warning: package 'future' was built under R version 4.0.3
    
    ## to clean out everything before a run
    # rm(list = ls(all = TRUE))
    # invisible(lapply(paste0('package:', names(sessionInfo()$otherPkgs)), detach, character.only=TRUE, unload=TRUE)) # https://stackoverflow.com/a/39235076 @mmfrgmpds
    
    #returns value:text for options e.g. divisions/districts (dropdown)
    options_df <- function(css_selector, level) {
      nodes <- session %>% html_nodes(css_selector)
      options <- nodes %>% map_df(~ c(html_attr(., "value"), html_text(.)) %>%
        set_names(paste0(level, "ID"), level))
      return(options)
    }
    
    #returns districts associated with division
    districts_from_updated_session <- function(division_id, level) {
      session <- jump_to(session, paste0("http://103.247.238.92/webportal/pages/ajaxDataDistrictDHIS2Dashboard.php?division_id=", division_id))
      return(options_df("#district option:nth-child(n+2)", level))
    }
    
    # returns json object housing latest 12 month covid numbers by divisionID + districtID pairing
    get_covid_data <- function(divisionID, districtID) {
      headers <- c(
        "user-agent" = "Mozilla/5.0",
        "if-modified-since" = "Wed, 08 Jul 2020 00:00:00 GMT" # to mitigate for caching
      )
    
      data <- list("division" = divisionID, "district" = districtID, "period" = "LAST_12_MONTH", "Submit" = "Search")
    
      r <- httr::GET(url = "http://103.247.238.92/webportal/pages/covid19.php", httr::add_headers(.headers = headers), body = data)
    
      data <- stringr::str_match(content(r, "text"), "DataTable\\((\\[[\\s\\S]+\\])\\)")[1, 2] %>% #clean up extracted string so can be parsed as valid json
        gsub("role", '"role"', .) %>%
        gsub("'", '"', .) %>%
        gsub(",\\s+\\]", "]", .) %>%
        str_squish() %>%
        jsonlite::parse_json()
      return(data)
    }
    
    
    url <- "http://103.247.238.92/webportal/pages/covid19.php"
    headers <- c("User-Agent" = "Mozilla/4.0", "Referer" = "http://103.247.238.92/webportal/pages/covid19.php")
    
    session <- html_session(url, httr::add_headers(.headers = headers)) #for tcp re-use
    
    divisions <- options_df("#division option:nth-child(n+2)", "division") #nth-child(n+2) to exclude initial 'select' option
    
    districts <- pmap_dfr(
      list(divisions$divisionID),
      ~ {
        df <- districts_from_updated_session(.x, "district") %>%
          mutate(
            divisionID = .x
          )
        return(df)
      }
    )
    
    div_district <- dplyr::inner_join(divisions, districts, by = "divisionID", copy = FALSE)
    
    no_cores <- future::availableCores() - 1
    
    future::plan(future::multisession, workers = no_cores)
    
    df <- div_district %>%
      mutate(json = future_map(divisionID, .f = get_covid_data, districtID))
    

    Created on 2021-03-04 by the reprex package (v0.3.0)


    Py

    import requests, re, ast
    from bs4 import BeautifulSoup as bs
    
    def options_dict(soup, css_selector):
        options = {i.text:i['value'] for i in soup.select(css_selector) if i['value']}
        return options
    
    def covid_numbers(text):
        covid_data = p.findall(text)[0]
        covid_data = re.sub(r'\n\s+', '', covid_data.replace("role","'role'"))
        covid_data = ast.literal_eval(covid_data) 
        return covid_data
    
    url = 'http://103.247.238.92/webportal/pages/covid19.php'
    regions = {}
    result = {}
    p = re.compile(r'DataTable\((\[[\s\S]+\])\)')
    
    with requests.Session() as s: 
        s.headers = {'User-Agent': 'Mozilla/5.0', 'Referer': 'http://103.247.238.92/webportal/pages/covid19.php'}
        soup = bs(s.get(url).content, 'lxml')
        divisions = options_dict(soup, '#division option')
    
        for k,v in divisions.items():
            r = s.get(f'http://103.247.238.92/webportal/pages/ajaxDataDistrictDHIS2Dashboard.php?division_id={v}')
            soup = bs(r.content, 'lxml')
            districts = options_dict(soup, '#district option')
            regions[k] = districts
       
        s.headers = {'User-Agent': 'Mozilla/5.0','if-modified-since': 'Wed, 08 Jul 2020 22:27:07 GMT'}
        
        for k,v in divisions.items():
            result[k] = {}
            for k2,v2 in regions.items():    
                data = {'division': k2, 'district': v2, 'period': 'LAST_12_MONTH', 'Submit': 'Search'}
                r = s.get('http://103.247.238.92/webportal/pages/covid19.php',  data=data)
                result[k][k2] = covid_numbers(r.text)