r web-scraping screen-scraping dashboard rvest

Scraping PHP dashboard with R (rvest)

I am trying to scrape Bangladesh COVID-19 data (number of tests, number of positive tests, positive rate) from the official website: http://103.247.238.92/webportal/pages/covid19.php

The website contains 3 drop-down menus to arrive at the data: Select Division; Select District; Select time frame for the data.

I have tried the following so far:

url <- "http://103.247.238.92/webportal/pages/covid19.php"
webpage <- read_html(url)

webpage has the following:

List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Since this did not help, I also tried the following based on this question:

a <- GET(url)
a <- content(a, as="text")
a <- gsub("^angular.callbacks._2\\(", "", a)
a <- gsub("\\);$", "", a)
df <- fromJSON(a, simplifyDataFrame = TRUE)

The above returns the following error:

Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <!-- This is a 
                     (right here) ------^

So I am really lost in terms of how I can even read the data - but upon looking at the source of the webpage, I know that the data is right there: Safari Website inspector

Any suggestions on how I can read this data?

Additionally, if someone could help with how I can go about selecting the different drop-down menu items, that would be really appreciated. The final goal is to collect data for each district in each division for the last 12 months.

Solution

tl;dr

The page makes additional requests to pick up that info. Those additional requests rely on combinations of ids; an id pulled from the option element value attribute, of each option within Division dropdown, in tandem with an id pulled from the option element value attribute of each option within the District dropdown.

You can make an initial request to get all the Division dropdown ids:

divisions <- options_df("#division option:nth-child(n+2)", "division")

nth-child(n+2) is used to exclude the initial 'select' option.

This returns a dataframe with the initial divisionIDs and friendly division names.

Those ids can then be used to retrieve the associated districtIDs (the options which become available in the second dropdown after making your selection in the first):

districts <- pmap_dfr(
  list(divisions$divisionID),
  ~ {
    df_districts <- districts_from_updated_session(.x, "district") %>%
      mutate(
        divisionID = .x
      )
    return(df_districts)
  }
)

This returns a dataframe mapping the divisionID to all the associated districtIDs, as well as the friendly district names:

By including the divisionID in both dataframes I can inner-join them:

div_district <- dplyr::inner_join(divisions, districts, by = "divisionID", copy = FALSE)

Up until now, I have been using a session object for the efficiency of tcp re-use. Unfortunately, I couldn't find anything in the documentation covering how to update an already open session allowing for sending a new POST request with dynamic body argument. Instead, I leveraged furrr::future_map to try and gain some efficiencies through parallel processing:

df <- div_district %>%
  mutate(json = furrr::future_map(divisionID, .f = get_covid_data, districtID))

To get the final covid numbers, via get_covid_data(), I leverage some perhaps odd behaviour of the server, in that I could make a GET, passing divisionID and districtID within the body, then regex out part of the jquery datatables scripting, string clean that into a json valid string, then read that into a json object stored in the json column of the final dataframe.

Inside of the json column

library(httr)
#> Warning: package 'httr' was built under R version 4.0.3
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.0.3
#> Warning: package 'forcats' was built under R version 4.0.3
library(jsonlite)
#> Warning: package 'jsonlite' was built under R version 4.0.3
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten
library(furrr)
#> Warning: package 'furrr' was built under R version 4.0.3
#> Loading required package: future
#> Warning: package 'future' was built under R version 4.0.3

## to clean out everything before a run
# rm(list = ls(all = TRUE))
# invisible(lapply(paste0('package:', names(sessionInfo()$otherPkgs)), detach, character.only=TRUE, unload=TRUE)) # https://stackoverflow.com/a/39235076 @mmfrgmpds

#returns value:text for options e.g. divisions/districts (dropdown)
options_df <- function(css_selector, level) {
  nodes <- session %>% html_nodes(css_selector)
  options <- nodes %>% map_df(~ c(html_attr(., "value"), html_text(.)) %>%
    set_names(paste0(level, "ID"), level))
  return(options)
}

#returns districts associated with division
districts_from_updated_session <- function(division_id, level) {
  session <- jump_to(session, paste0("http://103.247.238.92/webportal/pages/ajaxDataDistrictDHIS2Dashboard.php?division_id=", division_id))
  return(options_df("#district option:nth-child(n+2)", level))
}

# returns json object housing latest 12 month covid numbers by divisionID + districtID pairing
get_covid_data <- function(divisionID, districtID) {
  headers <- c(
    "user-agent" = "Mozilla/5.0",
    "if-modified-since" = "Wed, 08 Jul 2020 00:00:00 GMT" # to mitigate for caching
  )

  data <- list("division" = divisionID, "district" = districtID, "period" = "LAST_12_MONTH", "Submit" = "Search")

  r <- httr::GET(url = "http://103.247.238.92/webportal/pages/covid19.php", httr::add_headers(.headers = headers), body = data)

  data <- stringr::str_match(content(r, "text"), "DataTable\\((\\[[\\s\\S]+\\])\\)")[1, 2] %>% #clean up extracted string so can be parsed as valid json
    gsub("role", '"role"', .) %>%
    gsub("'", '"', .) %>%
    gsub(",\\s+\\]", "]", .) %>%
    str_squish() %>%
    jsonlite::parse_json()
  return(data)
}


url <- "http://103.247.238.92/webportal/pages/covid19.php"
headers <- c("User-Agent" = "Mozilla/4.0", "Referer" = "http://103.247.238.92/webportal/pages/covid19.php")

session <- html_session(url, httr::add_headers(.headers = headers)) #for tcp re-use

divisions <- options_df("#division option:nth-child(n+2)", "division") #nth-child(n+2) to exclude initial 'select' option

districts <- pmap_dfr(
  list(divisions$divisionID),
  ~ {
    df <- districts_from_updated_session(.x, "district") %>%
      mutate(
        divisionID = .x
      )
    return(df)
  }
)

div_district <- dplyr::inner_join(divisions, districts, by = "divisionID", copy = FALSE)

no_cores <- future::availableCores() - 1

future::plan(future::multisession, workers = no_cores)

df <- div_district %>%
  mutate(json = future_map(divisionID, .f = get_covid_data, districtID))

^{Created on 2021-03-04 by the reprex package (v0.3.0)}

import requests, re, ast
from bs4 import BeautifulSoup as bs

def options_dict(soup, css_selector):
    options = {i.text:i['value'] for i in soup.select(css_selector) if i['value']}
    return options

def covid_numbers(text):
    covid_data = p.findall(text)[0]
    covid_data = re.sub(r'\n\s+', '', covid_data.replace("role","'role'"))
    covid_data = ast.literal_eval(covid_data) 
    return covid_data

url = 'http://103.247.238.92/webportal/pages/covid19.php'
regions = {}
result = {}
p = re.compile(r'DataTable\((\[[\s\S]+\])\)')

with requests.Session() as s: 
    s.headers = {'User-Agent': 'Mozilla/5.0', 'Referer': 'http://103.247.238.92/webportal/pages/covid19.php'}
    soup = bs(s.get(url).content, 'lxml')
    divisions = options_dict(soup, '#division option')

    for k,v in divisions.items():
        r = s.get(f'http://103.247.238.92/webportal/pages/ajaxDataDistrictDHIS2Dashboard.php?division_id={v}')
        soup = bs(r.content, 'lxml')
        districts = options_dict(soup, '#district option')
        regions[k] = districts
   
    s.headers = {'User-Agent': 'Mozilla/5.0','if-modified-since': 'Wed, 08 Jul 2020 22:27:07 GMT'}
    
    for k,v in divisions.items():
        result[k] = {}
        for k2,v2 in regions.items():    
            data = {'division': k2, 'district': v2, 'period': 'LAST_12_MONTH', 'Submit': 'Search'}
            r = s.get('http://103.247.238.92/webportal/pages/covid19.php',  data=data)
            result[k][k2] = covid_numbers(r.text)