I'm trying to download a lot of files from the WorldPop UK site for a lot of countries I have in a dataset (not just the small example). Downloading each file would be very time consuming and tedious.
I'm fairly familiar with download methods in R, but I can't get these downloads to work. I know it is because the download links run through html somehow, but I'm no good with html or java.
I have done a lot of reading on httr, RCurl, and RSelenium. I'd prefer a solution avoiding RSelenium as I'm far more familiar with the other packages and may share the code and don't want to have to set up a browser every time (at least that is my understanding)
Can somebody help me out with this?
Direct download link to a small text (.txt) file that works fine in browser, but not in R using download.file or curl_download: http://www.worldpop.org.uk/data/files/index.php?dataset=140&action=download&file=60
Site with index of files for Nigeria for example (you can see the href= links in the html code): http://www.worldpop.org.uk/data/files/index.php?dataset=140&action=dir
On chrome, view-source:http://www.worldpop.org.uk/data/files/index.php?dataset=140&action=dir
The download links are between lines 558 and 559 on my developer console.
Thanks in advance!
Well, they certainly do not make this easy. On top of a convoluted "web app" they also tried to do the right thing and use sha1 has validation on sourced javascript resources, but failed to keep those maintained (i.e. secure browsers won't be able to work with that site).
Anyway, here's what you have to do to avoid splashr
or RSelenium
/seleniumPipes
. I used your "README" example and there are plenty of comments.
My advice is to wrap one or more bits into a function for easier use and also consider wrapping various calls in purrr
helpers like safely
(there are also "retry" examples oot and aboot).
library(httr)
library(rvest)
library(tidyverse)
# Need to "prime" the session with a cookie
res <- GET(url="http://www.worldpop.org.uk/data/data_sources/")
# Get the page contents
pg <- content(res)
# Find the summary links
summary_link_nodes <- html_nodes(pg, xpath=".//a[contains(@href,'summary')]")
# extract the table cells & href so we can make a data frame
map(summary_link_nodes, html_nodes, xpath=".//../..") %>%
map(html_children) %>%
map(html_text) %>%
map(~.[1:4]) %>%
map(as.list) %>%
map_df(set_names, c("continent", "country", "resolution", "data_type")) %>%
bind_cols(
data_frame(
summary_link = sprintf("http://www.worldpop.org.uk%s", html_attr(summary_link_nodes, "href"))
)
) -> world_pop_data
glimpse(world_pop_data)
## Observations: 462
## Variables: 5
## $ continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa", "Afri...
## $ country <chr> "Algeria", "Angola", "Benin", "Botswana", "Burkina Faso", "Burund...
## $ resolution <chr> "100m", "100m", "100m", "100m", "100m", "100m", "100m", "100m", "...
## $ data_type <chr> "Population", "Population", "Population", "Population", "Populati...
## $ summary_link <chr> "http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP0000...
# just see "Nigeria" data
filter(world_pop_data, country=="Nigeria")
## # A tibble: 8 x 5
## continent country resolution data_type summary_link
## <chr> <chr> <chr> <chr> <chr>
## 1 Africa Nigeria 100m Population http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00196
## 2 Africa Nigeria 1km Births http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00357
## 3 Africa Nigeria 1km Pregnancies http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00465
## 4 Africa Nigeria 1km Contraceptive Use http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00198
## 5 Africa Nigeria 1km Literacy http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00199
## 6 Africa Nigeria 1km Poverty http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00200
## 7 Africa Nigeria 1km Stunting http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00201
## 8 Africa Nigeria 100m Age structures http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00194
I'm fairly certain you can start any file download session attempt from one of ^^ URLs, but you need to test that as you may need to always start from the "main" page (as noted, it maintains session position based on cookies).
# get nigeria population URL
filter(world_pop_data, country=="Nigeria") %>%
filter(data_type=="Population") %>%
pull(summary_link) -> nigeria_pop
nigeria_pop
# [1] "http://www.worldpop.org.uk/data/summary?doi=10.5258/SOTON/WP00196"
# follow it
GET(url=nigeria_pop) -> res2
pg2 <- content(res2)
That page always has a <form>
on it so we need to "submit" said form with a POST
:
# extract "form" fields (that page does a POST request)
fields <- html_nodes(pg2, "form#conform > input")
fields <- set_names(xml_attr(fields, "value"), html_attr(fields, "name"))
str(as.list((fields))) # just to show what it looks like
## List of 4
## $ zip_id : chr "140"
## $ zip_title: chr "Nigeria 100m Population"
## $ decoy : chr "website"
## $ website : chr NA
# submit the form with the field data.
# NOTE we need to add the `Referer` (the faux page we're on)
POST(
url = "http://www.worldpop.org.uk/data/download/",
add_headers(`Referer` = nigeria_pop),
body = list(
client_first_name = "",
client_last_name = "",
client_organization = "",
client_country = "",
client_email = "",
client_message = "",
zip_id = fields["zip_id"],
zip_title = fields["zip_title"],
decoy = fields["decoy"],
website = "",
download = "Browse Individual Files"
),
encode = "form"
) -> res3
Somewhere on the resultant page is the "Switch to file list" link, so we need to find it and follow it:
# find the link that has the file list
pg3 <- content(res3)
html_nodes(pg3, xpath=".//a[contains(., 'switch to')]") %>%
html_attr("href") -> file_list_query_string # just to see the format
## [1] "?dataset=140&action=dir"
# follow that link (we need to use some of the previous captured fields)
GET(
url = "http://www.worldpop.org.uk/data/files/index.php",
query = list(
dataset = fields["zip_id"],
action = "dir"
)
) -> res4
Now, we build a data frame of all the links on that page:
pg4 <- content(res4)
data_frame(
group_name = html_nodes(pg4, "a.dl") %>% html_text(),
href = html_nodes(pg4, "a.dl") %>% html_attr("href")
) -> downloads
downloads
## # A tibble: 60 x 2
## group_name href
## <chr> <chr>
## 1 Licence.txt ?dataset=140&action=download&file=1
## 2 NGA_metadata.html ?dataset=140&action=download&file=2
## 3 NGA_pph_v2c_2006.tfw ?dataset=140&action=download&file=3
## 4 NGA_pph_v2c_2006.tif ?dataset=140&action=download&file=4
## 5 NGA_pph_v2c_2006.tif.aux.xml ?dataset=140&action=download&file=5
## 6 NGA_pph_v2c_2006.tif.xml ?dataset=140&action=download&file=6
## 7 NGA_pph_v2c_2010.tfw ?dataset=140&action=download&file=7
## 8 NGA_pph_v2c_2010.tif ?dataset=140&action=download&file=8
## 9 NGA_pph_v2c_2010.tif.aux.xml ?dataset=140&action=download&file=9
## 10 NGA_pph_v2c_2010.tif.xml ?dataset=140&action=download&file=10
## # ... with 50 more rows
While I noted earlier you may need to begin always from the start or from that previous link page, you may also be able to download all these sequentially as well. You need to do the testing tho. This is a painful site to deal with.
filter(downloads, str_detect(group_name, "README")) %>%
pull(href) -> readme_query_string # we need this below
readme_query_string
## [1] "?dataset=140&action=download&file=60"
# THERE IS A RLY GD CHANCE YOU'LL NEED TO USE timeout() for
# some of these calls. That server takes a while
# right here is where that modal "preparing the data" is shown.
# I'm 99% certain this is there to slow down crawlers/scrapers.
GET(
url = "http://www.worldpop.org.uk/data/files/index.php",
query = parse_url(readme_query_string)$query,
verbose()
) -> res5
This is not what you'll really do. You'll likely want to content(res5, as="raw")
and writeBin()
it since some (most) aren't plain text. But, this is to show that the above all works:
content(res5, as="text") %>%
cat()
## WorldPop Africa dataset details
## _______________________
##
## DATASET: Alpha version 2010, 2015 and 2020 estimates of numbers of people per pixel ('ppp') and people per hectare ('pph'), with national totals adjusted to match UN population division estimates (http://esa.un.org/wpp/) and remaining unadjusted.
## REGION: Africa
## SPATIAL RESOLUTION: 0.000833333 decimal degrees (approx 100m at the equator)
## PROJECTION: Geographic, WGS84
## UNITS: Estimated persons per grid square
## MAPPING APPROACH: Random Forest
## FORMAT: Geotiff (zipped using 7-zip (open access tool): www.7-zip.org)
## FILENAMES: Example - NGA_ppp_v2b_2010_UNadj.tif = Nigeria (NGA) population per pixel (ppp), mapped using WorldPOP modelling version 2b (v2b) for 2010 (2010) adjusted to match UN national estimates (UNadj).
## DATE OF PRODUCTION: February 2017
##
## Also included: (i) Metadata html file, (ii) Google Earth file, (iii) Population datasets produced using original census year data (2006).
If you do persist, consider adding an answer with what you ended up doing or turning it into a package so others can use it as well.