Search code examples
rweb-scrapinghttrrselenium

Use R to mimic "clicking" on a file to download it


I would like R to automatically download an excel file about Oil and Gas rigs from this page. The file is downloaded by clicking on "North America Rotary Rig Count Pivot Table (Feb 2011 - Current)" (second option) but I cannot seem to find a way to do this in R.

Any clues? Thanks!

Note: Unfortunately, using download.file() does not seem to work. I get a message when trying to open the file in MS Excel that the extension is incorrect or the file is corrupt. I also get this error in R when using readxl::read_excel(): Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo


Solution

  • Some libs to help
    You actually will need only dplyr, purrr, stringr, and xml2.

    library(tidyverse)
    library(rvest)
    #> Loading required package: xml2
    #> 
    #> Attaching package: 'rvest'
    #> The following object is masked from 'package:purrr':
    #> 
    #>     pluck
    #> The following object is masked from 'package:readr':
    #> 
    #>     guess_encoding
    library(htmltab)
    library(xml2)
    library(readxl)
    

    I like to do it this way because some sites use partial links.

    base <- "https://rigcount.bakerhughes.com"
    url <- paste0(base, "/na-rig-count")
    
    # find links
    url_html <- xml2::read_html(url)
    url_html %>% 
      html_nodes("a") %>% 
      html_attrs() %>% 
      bind_rows() -> url_tbl
    

    Check href content, find some pattern you are interested in. You may use inspect on your browser too, it is truly helpful.

    url_tbl %>% 
      count(href)
    #> # A tibble: 22 x 2
    #>    href                                                                        n
    #>    <chr>                                                                   <int>
    #>  1 /                                                                           1
    #>  2 /email-alerts                                                               1
    #>  3 /intl-rig-count                                                             1
    #>  4 /na-rig-count                                                               1
    #>  5 /rig-count-faqs                                                             1
    #>  6 /rig-count-overview                                                         2
    #>  7 #main-menu                                                                  1
    #>  8 https://itunes.apple.com/app/baker-hughes-rig-counts/id393570114?mt=8       1
    #>  9 https://rigcount.bakerhughes.com/static-files/4ab04723-b638-4310-afd9-…     1
    #> 10 https://rigcount.bakerhughes.com/static-files/4b92b553-a48d-43a3-b4d9-…     1
    #> # … with 12 more rows
    

    Perhaps, I notice that static-files may be a good pattern to href but then I found a better one, in type.

    url_tbl %>% 
      filter(str_detect(type, "ms-excel")) -> url_xlsx
    

    build our list (remember to avoid some noise as an extra dot, spaces, and special characters) I hope someone proposes a better way to avoid those things

    myFiles <- pull(url_xlsx, "href")
    names <- pull(url_xlsx, "title")
    names(myFiles) <- paste0(
        str_replace_all(names, "[\\.\\-\\ ]", "_"), 
        str_extract(names, ".\\w+$")
    )
    
    # download data
    myFiles %>% 
      imap(
        ~ download.file(
          url = .x, 
          destfile = .y,
          method="curl", # might be not necessary 
          extra="-k"
        )
      )
    #> $`north_america_rotary_rig_count_jan_2000_-_current.xlsb`
    #> [1] 0
    #> 
    #> $`north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb`
    #> [1] 0
    #> 
    #> $`U.S.  Monthly Averages by State 1992-2016.xls`
    #> [1] 0
    #> 
    #> $`North America Rotary Rig Counts through 2016.xls`
    #> [1] 0
    #> 
    #> $`U.S. Annual Averages by State 1987-2016.xls`
    #> [1] 0
    #> 
    #> $Workover_9.xls
    #> [1] 0
    

    Created on 2020-12-16 by the reprex package (v0.3.0)

    Now you may see your files.

    names(myFiles) %>%
        map(
            readxlsb:read_xlsb
        ) -> myData
    

    I hope it helps.