Search code examples
rreadrreadxl

load corrupted xls file into r without manually changing file type


I am struggling to download an excel file and then loading it to R:

utils::download.file(
  url = 'https://servicos.ibama.gov.br/ctf/publico/areasembargadas/downloadListaAreasEmbargadas.php',
  destfile = 'C:/users/arthu/Desktop/fines.rar',
  mode = "wb"
)

After unzipping and trying to load it into R:

utils::unzip(
  zipfile = './fines.rar',
  exdir = './ibama_data'
)


dados <- readxl::read_xls(
  "./ibama_data/rel_areas_embargadas_0-65000_2020-12-10_080019.xls"),
  skip = 6,
  col_type = c(rep("guess", 13), "date", "guess", "date")
)

I get libxls error: Unable to open file.

If I try to rename the file as .xlsx as follows, I get an evaluation error when reading it with readxl::read_excel, saying unable to open file

 file <-   file.rename(
      from = "./Desktop/ibama_data/rel_areas_embargadas_0-65000_2020-12-10_080019.xls",
      to = "./Desktop/ibama_data/test.xlsx"
    )

However, if I manually open such a file, excel throws me a warning saying that the file's extension does not match its type. After saving it as .xlsx, I can finally load it using read_excel

How can I solve this, given that I want to write a package with a function that downloads such data from the web and then loads it into R?

Edit


Solution

  • The .xls file you are trying to read isn't an Excel document, it's an HTML table.
    You could read it using XML package :

    library(XML)
    doc <- htmlParse('rel_areas_embargadas_0-65000_2021-01-13_080018.xls')
    tableNode <- getNodeSet(doc, '//table')
    data <- XML::readHTMLTable(tableNode[[1]])
    
    #Store header
    header <- data[1:5,]
    
    #Store colnames
    colnames <- data[6,]
    
    #Remove header
    data <- data[-1:-6,]
    
    #Set colnames
    colnames(data)<-colnames
    
    head(data)