Search code examples
xmlrhtml-tablehttrrvest

Read table HTML in dropbox with XML package


I will try to read a table HTML in dropbox with XML package, but the XML::readHTMLTable function doesn’t work in html in dropbox and I don’t know why, someone could help me?

My code:

Packages

require(httr)
require(XML) 

Open table html file in dropbox

FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0") 

Read the table

tables <- getNodeSet(htmlParse(FILE), "//table") 
FE_tab <- readHTMLTable(tables[2], 
                    header = c("empresa","desc_projeto","desc_regiao", 
"cadastrador_por","cod_talhao","descricao", 
"formiga_area","qtd_destruido","latitude", 
                               "longitude","data_cadastro"), 
                    colClasses = c("character","character","character", 
"character","character","character", 
"character","character","character", 
                                   "character","character"), 
                    trim = TRUE, stringsAsFactors = FALSE 
                   ) 
head(FE_tab) ### Doesn’t work


Solution

  • You can do it as follows:

    require(rvest)
    doc <- read_html("https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
    FE_tab <- doc %>% html_table() %>% `[[`(1)
    

    Within your code you need to use ?dl=1 at the end of the URL. Otherwise you get the sourcecode of the dropbox page that displays if you open https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=0

    If you still want to use the XML package do:

    FILE <- GET(url="https://www.dropbox.com/s/mb316ghr4irxipr/TALHOES_AGENTES.htm?dl=1")
    tables <- getNodeSet(htmlParse(FILE), "//table") 
    FE_tab <- readHTMLTable(tables[[1]], 
                            header = c("empresa","desc_projeto","desc_regiao", 
                                       "cadastrador_por","cod_talhao","descricao", 
                                       "formiga_area","qtd_destruido","latitude", 
                                       "longitude","data_cadastro"), 
                            colClasses = c("character","character","character", 
                                           "character","character","character", 
                                           "character","character","character", 
                                           "character","character"), 
                            trim = TRUE, stringsAsFactors = FALSE 
    ) 
    head(FE_tab)
    

    As tables is a list: use tables[[1]] and use 1 instead of 2 as there is only one list-element within tables.