Search code examples
jsonrweb-scrapingwebhttr

httr GET function read table


I want to scrape this website, and get the data from the table.

I use GET from the package httr, code is like below:

url <- 'http://datacenter.mep.gov.cn/report/water/water.jsp?'
year <- 2016
wissue <- 2

res <- GET(url,
           query = list(year = year,
                        wissue = wissue))


resC <- content(res, as = 'text', encoding = 'utf-8')

But what I got is not a json string but something very strange like below:

"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<html>\r\n\t<head>\r\n\t\t<title>中华人民共和国环境保护部--政府网站数据中心</title>\r\n\t\t<meta http-equiv=\"content-type\" content=\"text/html;

I wonder is there anyway to parse this format?


Solution

  • The rowspan attribute is going to make dealing with this table pretty interesting. You have a few choices, two of which are:

    1. use html_table() on the target <table> using fill=TRUE and perform surgery on the resultant data frame
    2. attack it at the <tr>-level and build the data frame from the ground up

    This answer does the latter.

    library(rvest)
    library(purrr)
    

    First, we get the content in a form we can perform XML/HTML surgery on:

    content(res, as = 'text', encoding = 'utf-8') %>% 
      read_html() -> pg
    

    Next, we target and extract the table node with the report:

    tab <- html_nodes(pg, "table#report1") 
    

    Here's te tricky bit. We first target all the <tr> elements that have @rowspan attributes but no <td> elements with a @colspan attribute:

    html_nodes(tab, xpath=".//tr[td[not(@colspan) and @rowspan]]") %>% 
    

    Next, we process those invidivually:

      map_df(function(x) {
    

    We get the # of rows the <tr> spans:

        html_nodes(x, xpath=".//td[@rowspan]") %>% 
          html_attr("rowspan") %>% 
          as.numeric() -> row_ct
    

    Find all the sibling <tr> elements and reduce the set to the remaining ones in this <tr> "block":

        rows <- html_nodes(x, xpath=".//following-sibling::tr")
        rows <- rows[1:(row_ct-1)] 
    

    Make a data frame from that first block row

        html_nodes(x, xpath=".//td") %>% 
          html_text() %>% 
          setNames(sprintf("X%d", 1:13)) %>% 
          as.list() %>% 
          flatten_df() -> first
    

    Go through all filtered sibling rows and do the same, leaving room to fill in the spanned column:

        map_df(rows, ~html_nodes(., xpath=".//td") %>% 
                 html_text() %>% 
                 setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>% 
                 as.list()) %>% 
          mutate(X3=first$X3) %>% 
          select(X1, X2, X3, everything()) -> rest
    
        bind_rows(first, rest)
    
      }) -> h2o_df
    
    dplyr::glimpse(h2o_df)
    

    I can't paste the output of that since SO's javascript text filter is so brain dead it thinks that the post is spam just b/c it has kanji characters.

    Here's all the code in a contiguous chunk:

    tab <- html_nodes(pg, "table#report1") 
    
    html_nodes(tab, xpath=".//tr[td[not(@colspan) and @rowspan]]") %>% 
      map_df(function(x) {
    
        html_nodes(x, xpath=".//td[@rowspan]") %>% 
          html_attr("rowspan") %>% 
          as.numeric() -> row_ct
    
        rows <- html_nodes(x, xpath=".//following-sibling::tr")
        rows <- rows[1:(row_ct-1)] 
    
        html_nodes(x, xpath=".//td") %>% 
          html_text() %>% 
          setNames(sprintf("X%d", 1:13)) %>% 
          as.list() %>% 
          flatten_df() -> first
    
        map_df(rows, ~html_nodes(., xpath=".//td") %>% 
                 html_text() %>% 
                 setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>% 
                 as.list()) %>% 
          mutate(X3=first$X3) %>% 
          select(X1, X2, X3, everything()) -> rest
    
        bind_rows(first, rest)
    
      }) -> h2o_df