Search code examples
htmlrweb-scrapingxpath

Unable to select a specific html table using xpathSapply in R


I am trying to scrape the second table from the following link http://cepea.esalq.usp.br/frango/?page=379&Dias=15

I have tried the following R code using the XML package:

    p_frango_resfriado <- htmlTreeParse("http://cepea.esalq.usp.br/frango/?page=379&Dias=15", 
    useInternalNodes = TRUE, 
    encoding = "UTF-8")

    xpathSApply(p_frango_resfriado, "//table[@width = '95%']//tr//td[2]", xmlValue)
    xpathSApply(p_frango_resfriado, "//table[@width = '95%']//tr//td[3]", xmlValue)
    xpathSApply(p_frango_resfriado, "//table[@width = '95%']//tr//td[4]", xmlValue)

The problem is that this code scrapes both html tables in the webpage, and I only want to scrape the second one. I have tried, the code below, which does not return anything interesting:

xpathSApply(p_frango_resfriado, 
"//a[text() = 'Preços do frango resfriado CEPEA/ESALQ - Estado SP']/table[@width = '95%']", 
xmlValue)

Could anyone please help me with this problem? I am not very good with the XPath language and html.


Solution

  • Using XML::xmlToDataFrame with XPath query

    library("httr")
    library("XML")
    URL <- "http://cepea.esalq.usp.br/frango/?page=379&Dias=15"
    temp <- tempfile(fileext = ".html")
    GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
    

    The only difference between the two tables is the table names used in the xpath query

    Table 1: Preços do frango congelado CEPEA/ESALQ - Estado SP

    xpexpr <- "//center/a[contains(., 'do frango congelado')]/../table/tr/td/font/tr"
    

    Table 2: Preços do frango resfriado CEPEA/ESALQ - Estado SP

    xpexpr <- "//center/a[contains(., 'do frango resfriado')]/../table/tr/td/font/tr"
    
    doc <- htmlParse(temp)
    listofTableNodes <- getNodeSet(doc, xpexpr)
    length_nodes <- length(listofTableNodes)
    include_indices1 <- 1:(length_nodes - 2)
    
    # create dataframe using xmlvalues of the nodelist. Both `getNodeSet()` 
    # and `xpathSApply` will provide identical results.
    # using `getNodeSet()`
    df <- xmlToDataFrame(listofTableNodes[include_indices1], stringsAsFactors=FALSE)
    # using `xpathSApply`
    df <- xmlToDataFrame(xpathSApply(doc, xpexpr)[include_indices1], stringsAsFactors=FALSE)
    
    # clean data
    df$td <- as.Date(gsub("[Â ]\\s*", "", df$td), format = "%d/%m/%Y")
    df[, 4] <- gsub("\t$", '', df[, 4])
    
    # add column names
    xpexpr <- "//center/a[contains(., 'do frango resfriado')]/../table/tr/td/font/text()"
    # for Table-1
    # xpexpr <- "//center/a[contains(., 'do frango congelado')]/../table/tr/td/font/text()"
    listofTableNodes <- getNodeSet(doc, xpexpr)
    colnames(df) <- c('Date', sapply(listofTableNodes, xmlValue))
    df
    #            Date Valor R$ Var./dia Var./mês
    #   1  2016-08-17     4,37    0,46%     8,17%
    #   2  2016-08-16     4,35    0,46%     7,67%
    #   3  2016-08-15     4,33    0,46%     7,18%
    #   4  2016-08-12     4,31    0,00%     6,68%
    #   5  2016-08-11     4,31    0,70%     6,68%
    #   6  2016-08-10     4,28    0,47%     5,94%
    #   7  2016-08-09     4,26   -0,70%     5,45%
    #   8  2016-08-08     4,29    3,87%     6,19%
    #   9  2016-08-05     4,13    0,49%     2,23%
    #   10 2016-08-04     4,11    0,00%     1,73%
    #   11 2016-08-03     4,11    1,73%     1,73%
    #   12 2016-08-02     4,04    0,00%     0,00%
    #   13 2016-08-01     4,04    0,00%     0,00%
    #   14 2016-07-29     4,04    0,00%    -0,49%
    #   15 2016-07-28     4,04   -0,25%    -0,49%
    

    Note: Everyday, the values are updated on this webpage, which will be taken into account using length_nodes.

    Using XML::readHTMLTable without XPath query

    library("httr")
    library("XML")
    URL <- "http://cepea.esalq.usp.br/frango/?page=379&Dias=15"
    temp <- tempfile(fileext = ".html")
    GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
    df <- readHTMLTable(temp, stringAsFactors = FALSE, which = 8)
    # Table 1
    df[4:18,]
    # Table 2
    df[28:42,]
    

    Using XML::readHTMLTable with XPath query

    library("httr")
    library("XML")
    URL <- "http://cepea.esalq.usp.br/frango/?page=379&Dias=15"
    temp <- tempfile(fileext = ".html")
    GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
    doc <- htmlParse(temp)
    
    # XPath Query
    # Table -1
    xpexpr <- "//center/a[contains(., 'do frango congelado')]/../table/tr/td/font"
    df <- xpathSApply(doc, xpexpr, readHTMLTable)
    include_indices <- 1:(nrow(df[[4]]) -4)
    df <- df[[4]][include_indices,]
    
    # Table-2
    xpexpr <- "//center/a[contains(., 'do frango resfriado')]/../table/tr/td/font"
    df <- xpathSApply(doc, xpexpr, readHTMLTable)
    include_indices <- 1:(nrow(df[[4]]) -4)
    df <- df[[4]][include_indices,]