httr GET function read table

I want to scrape this website, and get the data from the table.

I use GET from the package httr, code is like below:

url <- 'http://datacenter.mep.gov.cn/report/water/water.jsp?'
year <- 2016
wissue <- 2

res <- GET(url,
           query = list(year = year,
                        wissue = wissue))


resC <- content(res, as = 'text', encoding = 'utf-8')

But what I got is not a json string but something very strange like below:

"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<html>\r\n\t<head>\r\n\t\t<title>中华人民共和国环境保护部--政府网站数据中心</title>\r\n\t\t<meta http-equiv=\"content-type\" content=\"text/html;

I wonder is there anyway to parse this format?

Solution

The rowspan attribute is going to make dealing with this table pretty interesting. You have a few choices, two of which are:

use html_table() on the target <table> using fill=TRUE and perform surgery on the resultant data frame
attack it at the <tr>-level and build the data frame from the ground up

This answer does the latter.

library(rvest)
library(purrr)

First, we get the content in a form we can perform XML/HTML surgery on:

content(res, as = 'text', encoding = 'utf-8') %>% 
  read_html() -> pg

Next, we target and extract the table node with the report:

tab <- html_nodes(pg, "table#report1")

Here's te tricky bit. We first target all the <tr> elements that have @rowspan attributes but no <td> elements with a @colspan attribute:

html_nodes(tab, xpath=".//tr[td[not(@colspan) and @rowspan]]") %>%

Next, we process those invidivually:

  map_df(function(x) {

We get the # of rows the <tr> spans:

    html_nodes(x, xpath=".//td[@rowspan]") %>% 
      html_attr("rowspan") %>% 
      as.numeric() -> row_ct

Find all the sibling <tr> elements and reduce the set to the remaining ones in this <tr> "block":

    rows <- html_nodes(x, xpath=".//following-sibling::tr")
    rows <- rows[1:(row_ct-1)]

Make a data frame from that first block row

    html_nodes(x, xpath=".//td") %>% 
      html_text() %>% 
      setNames(sprintf("X%d", 1:13)) %>% 
      as.list() %>% 
      flatten_df() -> first

Go through all filtered sibling rows and do the same, leaving room to fill in the spanned column:

    map_df(rows, ~html_nodes(., xpath=".//td") %>% 
             html_text() %>% 
             setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>% 
             as.list()) %>% 
      mutate(X3=first$X3) %>% 
      select(X1, X2, X3, everything()) -> rest

    bind_rows(first, rest)

  }) -> h2o_df

dplyr::glimpse(h2o_df)

I can't paste the output of that since SO's javascript text filter is so brain dead it thinks that the post is spam just b/c it has kanji characters.

Here's all the code in a contiguous chunk:

tab <- html_nodes(pg, "table#report1") 

html_nodes(tab, xpath=".//tr[td[not(@colspan) and @rowspan]]") %>% 
  map_df(function(x) {

    html_nodes(x, xpath=".//td[@rowspan]") %>% 
      html_attr("rowspan") %>% 
      as.numeric() -> row_ct

    rows <- html_nodes(x, xpath=".//following-sibling::tr")
    rows <- rows[1:(row_ct-1)] 

    html_nodes(x, xpath=".//td") %>% 
      html_text() %>% 
      setNames(sprintf("X%d", 1:13)) %>% 
      as.list() %>% 
      flatten_df() -> first

    map_df(rows, ~html_nodes(., xpath=".//td") %>% 
             html_text() %>% 
             setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>% 
             as.list()) %>% 
      mutate(X3=first$X3) %>% 
      select(X1, X2, X3, everything()) -> rest

    bind_rows(first, rest)

  }) -> h2o_df