Search code examples
rformsweb-scrapingrvest

R: Web scraping .aspx form - no results


Novice to web-scrapping here... Similar questions have been posted (and answered) but I can't seem to successfully apply I'm trying loop over a data set and get some scores (The percentile of atherosclerosis) using the following online calculator: https://www.mesa-nhlbi.org/Calcium/input.aspx (for example plugging in Score = 30, gender=0(female),Race =3(white),Age=50 will get you 94%)

However I cant seem to get any results matching the manual execution of the calculator - this is my code: (+Thanks in advance!!)

#if(!require("devtools"))
#install.packages("devtools")
#devtools::install_github("omegahat/RHTMLForms")
#install.packages("XML")

library(XML)
library(RCurl)
library(httr)
library(tidyverse)
library(RHTMLForms)
(https://stackoverflow.com)library(rvest)

cur_url <- "https://www.mesa-nhlbi.org/Calcium/input.aspx/"
cur_session <- html_session(cur_url)

cur_Form <- html_form(cur_session)

cur_fill <- set_values(cur_Form[[1]],
                   Score = '30',
                   gender='0',
                   Race ='3',
                   Age='50')

cur_set <- submit_form(cur_session, cur_fill,submit = "Calculate")
content(cur_set$response)

Using the rvest library I've read the url into a "html_session" variable and extracted the form via "html_form"

cur_url <- "https://www.mesa-nhlbi.org/Calcium/input.aspx/"
cur_session <- html_session(cur_url)
cur_Form <- html_form(cur_session)

updated the relevant fields using the set_values function and then used submit_form to execute -

cur_fill <- set_values(cur_Form[[1]], Score = '30',gender='0',Race ='3',Age='50')
cur_set <- submit_form(cur_session, cur_fill,submit = "Calculate")

however I don't seem to get any relevant results in the cur_set veriable Any help on the matter will be greatly appreciated..


Solution

  • If you look at the web page in a browser (e.g., firefox, chrome) and enable the dev-console, you can see certain id fields and such that will help identify what you need.

    Up front, rvest (1.0.3 in my usage) has deprecated several functions you are using. I believe it'll work for now as-is, but I'm using the recommended functions:

    • session() in lieu of html_session()
    • html_form_set() in lieu of set_values()
    • session_submit() in lieu of submit_form()
    library(rvest)
    cur_url <- "https://www.mesa-nhlbi.org/Calcium/input.aspx/"
    cur_session <- session(cur_url)
    cur_Form <- html_form(cur_session)
    cur_fill <- html_form_set(cur_Form[[1]], Score = '30',gender='0',Race ='3',Age='50')
    cur_set <- session_submit(cur_session, cur_fill,submit = "Calculate")
    

    Various things you can get from this:

    html_table(cur_set)
    # [[1]]
    # # A tibble: 2 × 4
    #   X1    X2    X3    X4   
    #   <chr> <chr> <chr> <chr>
    # 1 25th  50th  75th  90th 
    # 2 0     0     0     8    
    

    From the dev-browser, we find specific areas, notably scoreLabel (30) and others:

    screenshot of dev-browser showing location of scoreLabel

    Similarly for percLabel (90) and Label10 ("16 %.").

    From this,

    html_nodes(cur_set, "#Label10") %>%
      html_text()
    # [1] "16 %."
    html_nodes(cur_set, "#scoreLabel") %>%
      html_text()
    # [1] "30"
    html_nodes(cur_set, "#percLabel") %>%
      html_text()
    # [1] "94"