Search code examples
rcsvhttr

Getting data in R as dataframe from web source


I am trying to load some air pollution background data directly into R as a data.frame using the RCurl package.

The website in question has 3 dropdown boxes to choose options before downloading the .csv file as shown in figure below:

enter image description here

I am trying to select 3 values from the drop down box and download the data using "Download CSV" button directly into R as a data.frame.

I want to download the different combinations of multiple years and multiple pollutants for a specific site.

In other posts on StackOverflow I have come across getForm function from the RCurl package but I don't understand how to control the 3 dropdown boxes with this function.

The URL for the data source is: http://uk-air.defra.gov.uk/data/laqm-background-maps?year=2011


Solution

  • For this website you can construct a url and submit a GET request to simply get the csv:

    library(httr)
    baseURL <- "http://uk-air.defra.gov.uk/data/laqm-background-maps.php"
    queryList <- parse_url(baseURL)
    queryList$query <- list("bkgrd-la" = 359, "bkgrd-pollutant" = "no2", "bkgrd-year" = 2011,
                            action = "data", year = 2011, submit = "Download+CSV")
    res <- GET(build_url(queryList), write_disk("temp.csv"))
    

    You can get the codes for the form by parsing the original page:

    library(XML)
    doc <- htmlParse("http://uk-air.defra.gov.uk/data/laqm-background-maps?year=2011")
    councils <- doc["//*[@id='bkgrd-la']/option", fun = function(x){
      data.frame(value = xmlGetAttr(x, "value"), council = xmlValue(x))
      }]
    councils <- do.call(rbind.data.frame, councils)
    > head(councils)
    value                      council
    1   359        Aberdeen City Council
    2   360        Aberdeenshire Council
    3     1        Adur District Council
    4     2    Allerdale Borough Council
    5     4 Amber Valley Borough Council
    6   401      Anglesey County Council
    
    pollutants <- doc["//*[@id='bkgrd-pollutant']/option", fun = function(x){
      data.frame(value = xmlGetAttr(x, "value"), council = xmlValue(x))
    }]
    pollutants <- do.call(rbind.data.frame, pollutants)
    > head(pollutants)
    value council
    1   no2     NO2
    2   nox     NOx
    3  pm10    PM10
    4  pm25   PM2.5
    5   no2     NO2
    6   nox     NOx
    

    etc...