Search code examples
rweb-scrapingrvesthidden-field

R: Web scraping .aspx form with hidden fields, "Unknown field names" error


For two days, I have been attempting to puzzle out how to fill out a form and submit it to download .csv files from https://www.igb.illinois.gov/VideoReports.aspx. I cannot seems to crack it, unfortunately. Full disclosure: I am a novice web scraper. I can do basic scraping, but this is new territory for me. I'm hoping in the end to write a program that will pull down all establishments' monthly revenue reports back to September 2009.

It seems the main issue has to do with how the form is laid out. I can't seem to figure out how to designate the fields I'd like to fill in to request the .csv file. I have been using rvest and RHTMLForms. I've located the form in the chrome dev tools and can see everything I need. I just can't seem to drill down to where I need to go to submit the query.

Here's where I've gotten so far:

library('rvest')
library('RHTMLForms')

igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- read_html(igb)

igbForm <- html_form(igb_html)
igbForm

The issue seems to start here. The "form" has only one element, and it includes hidden inputs. The fields I want to query are toward the end. It looks like this ...

[[1]]
<form> 'aspnetForm' (POST VideoReports.aspx)
  <input hidden> '__VIEWSTATE': /wEPDwUKMTU1MTExNzA3NQ9kFgJmD2QWAgIDD2QWAgIBD2QWBAIBD2QWEgIDDw8WAh4EVGV4dAUOU2VwdGVtYmVyIDIwMTJkZAIFDw8WAh8ABQ1GZWJydWFyeSAyMDIwZGQCFQ9kFgICAw8QZBAVAg5TdW1tYXJ5IHJlcG9ydA1EZXRhaWwgcmVwb3J0FQIOU3VtbWFyeSByZXBvcnQNRGV0YWlsIHJlcG9ydBQrAwJnZ2RkAhcPZBYCAgMPEA8WBh4ORGF0YVZhbHVlRmllbGQFA0tleR4NRGF0YVRleHRGaWVsZAUFVmFsdWUeC18 ....[TRUNCATE]

and at the very end, I get to what I'd like to query ...

  <input radio> 'ctl00$MainPlaceHolder$SearchType': TypeStatewide
  <input radio> 'ctl00$MainPlaceHolder$SearchType': TypeMuni
  <input radio> 'ctl00$MainPlaceHolder$SearchType': TypeEst
  <select> 'ctl00$MainPlaceHolder$SearchStateType' [1/2]
  <select> 'ctl00$MainPlaceHolder$SearchMunicipality' [0/1069]
  <select> 'ctl00$MainPlaceHolder$SearchEstablishment' [0/10182]
  <input text> 'ctl00$MainPlaceHolder$SearchLicenseNumber': 
  <select> 'ctl00$MainPlaceHolder$SearchStartMonth' [1/12]
  <select> 'ctl00$MainPlaceHolder$SearchStartYear' [1/9]
  <select> 'ctl00$MainPlaceHolder$SearchEndMonth' [1/12]
  <select> 'ctl00$MainPlaceHolder$SearchEndYear' [1/9]
  <input radio> 'ctl00$MainPlaceHolder$ViewType': ViewPDF
  <input radio> 'ctl00$MainPlaceHolder$ViewType': ViewCSV

I used the following to peak into what I need ...

igb_form <- getHTMLFormDescription(igb_html)
igb_form[[1]]

... and this code to locate the fields and values for each of them. For example ...

igb_form_att <- igb_form[[1]]
igb_form_att$elements[[9]]

... shows me the start month field and values from the dropdown menu ...

ctl00$MainPlaceHolder$SearchStartMonth: [ February ]  January, February, March, April, May, June, July, August, September, October, November, December

I thought this would do it. So I ran the following ...

igb_fill <- set_values(igb_html,
                      'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
                      'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
                      'ctl00$MainPlaceHolder$SearchEstablishment' ='',
                      'ctl00$MainPlaceHolder$SearchStartMonth'='September',
                      'ctl00$MainPlaceHolder$SearchStartYear'='2009',
                      'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
                      'ctl00$MainPlaceHolder$SearchEndYear'='2020',
                      'ctl00$MainPlaceHolder$ViewType'='ViewCSV')

submit_form(session=igb_html, form=igb_fill, POST(igb))

But received this error ...

Error: Unknown field names: ctl00$MainPlaceHolder$SearchType, ctl00$MainPlaceHolder$SearchEstablishment, ctl00$MainPlaceHolder$SearchStartMonth, ctl00$MainPlaceHolder$SearchStartYear, ctl00$MainPlaceHolder$SearchEndMonth, ctl00$MainPlaceHolder$SearchEndYear, ctl00$MainPlaceHolder$ViewType
Traceback:

1. set_values(igb_form, `ctl00$MainPlaceHolder$SearchType` = "TypeEst", 
 .     `ctl00$MainPlaceHolder$SearchEstablishment` = "All Establishments", 
 .     `ctl00$MainPlaceHolder$SearchEstablishment` = "", `ctl00$MainPlaceHolder$SearchStartMonth` = "September", 
 .     `ctl00$MainPlaceHolder$SearchStartYear` = "2009", `ctl00$MainPlaceHolder$SearchEndMonth` = "February", 
 .     `ctl00$MainPlaceHolder$SearchEndYear` = "2020", `ctl00$MainPlaceHolder$ViewType` = "ViewCSV")
2. stop("Unknown field names: ", paste(no_match, collapse = ", "), 
 .     call. = FALSE)

Apologies for the long-winded question, but I've poked around a lot on this and can't seem to find an answer that helps me get me to where I need to go. Maybe I'm in over my head. But I'd appreciate any help! (I'm also pretty sure the submit code is wrong, but I can tackle that after this.)


Solution

  • There were some issues with your code:

    • The set_values(...) function takes a form, not the entire html, so I replaced igb_html with igb_form there.
    • The submit_form(...) function takes an html_session, so I replaced read_html(igb) with html_session(igb).

    The following code should work:

    library(rvest)
    
    igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
    igb_html <- html_session(igb)
    
    igb_form <- html_form(igb_html)[[1]]
    
    igb_fill <- set_values(igb_form,
                           'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
                           'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
                           'ctl00$MainPlaceHolder$SearchEstablishment' ='',
                           'ctl00$MainPlaceHolder$SearchStartMonth'='September',
                           'ctl00$MainPlaceHolder$SearchStartYear'='2009',
                           'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
                           'ctl00$MainPlaceHolder$SearchEndYear'='2020',
                           'ctl00$MainPlaceHolder$ViewType'='ViewCSV')
    
    igb_html <- submit_form(igb_html, igb_fill, submit = "ctl00$MainPlaceHolder$ButtonSearch")
    
    igb_html