Search code examples
rweb-scrapingrvest

scraping data from ITU download links with rvest


I am wanting to get the download links for each of the files on the website https://datahub.itu.int/indicators/ but am struggling to get what I need.

Each indicator seems to contain a direct link to download the data in the following format https://api.datahub.itu.int/v2/data/download/byid/XXX/iscollection/YYY where XXX is some sort of number between 1 and 100,000+ or so and YYY is either true or false.

Ideally, I would like to get the link to each indicator and a corresponding name/html text of the link in one big dataframe.

I have tried to get the links for the files using rvest and various combinations of html_nodes and html_attrs and xpaths. but have not had any luck. I really want to avoid running a loop and brute force 100,000+ download links because that is horribly inefficient and will almost certainly cause issues for their servers.

I am not sure if there is a better way than using rvest, but any help would be most appreciated.

library(rvest)
library(httr)
library(tidyverse)
library(dplyr)

page = "https://datahub.itu.int/indicators/"
read_html(page) %>%
  html_attr("href")

Solution

  • If you look at the requests the pages makes (e.g. in the browser devtools) you will find that there is a request to an api which retrieves all the link; from this you can build the urls yourself: (the other solution would be to use RSelenium, but this would be much more complicated)

    library(httr)
    library(tidyverse)
    
    GET("https://api.datahub.itu.int/v2/dictionaries/getcategories") %>%
      content() %>%
      map(as_tibble) %>%
      bind_rows() %>%
      unnest_wider(subCategory) %>%
      unnest(items) %>%
      unnest_wider(items) %>%
      mutate(url = paste0("https://api.datahub.itu.int/v2/data/download/byid/",
                          codeID,
                          "/iscollection/",
                          tolower(as.character(isCollection)))) %>%
      select(category, codeID, label, subCategory, isCollection, url)
    #> # A tibble: 181 × 6
    #>    category     codeID label                      subCategory isCollection url  
    #>    <chr>         <int> <chr>                      <chr>       <lgl>        <chr>
    #>  1 Connectivity   8941 Households with a radio    Access      FALSE        http…
    #>  2 Connectivity   8965 Households with a TV       Access      FALSE        http…
    #>  3 Connectivity 100002 Households with multichan… Access      TRUE         http…
    #>  4 Connectivity   8749 Households with telephone… Access      FALSE        http…
    #>  5 Connectivity  20719 Individuals who own a mob… Access      FALSE        http…
    #>  6 Connectivity  12046 Households with a computer Access      FALSE        http…
    #>  7 Connectivity  12047 Households with Internet … Access      FALSE        http…
    #>  8 Connectivity 100001 Households with access to… Access      TRUE         http…
    #>  9 Connectivity 100000 Reasons for not having In… Access      TRUE         http…
    #> 10 Connectivity     15 Fixed-telephone subscript… Access      FALSE        http…
    #> # … with 171 more rows
    

    Created on 2022-07-05 by the reprex package (v2.0.1)