Search code examples
htmlrexport-to-csv

R script to download csv file from a web page where the data link is of data:text/csv


I am trying to write a R script that can download the csv file from the following website,"https://www.covidanalytics.io/projections There is a link for download data at the bottom of the page, which takes the form of "data:text/csv...". I was wondering if I can have a R script to download the file in csv format. Any help will be highly appreciated.

Thanks IC


Solution

  • There are easier ways to get this data, but it is possible with a bit of low-level work using the httr package.

    As @r2evans pointed out, this is a url-encoded csv built by Dash. To get the url, you need to request a json file containing the html page information using an xhr request. This needs all the correct headers as well as a json request in the body of the POST request:

    library(httr)
    
    page1 <- GET("https://www.covidanalytics.io/projections")
    
    H <- add_headers( `Host` = "www.covidanalytics.io",
                      `User-Agent` = paste("Mozilla/5.0 (Windows NT 6.1; rv:77.0)", 
                                           "Gecko/20100101 Firefox/77.0"),
                      `Accept` = "application/json",
                      `Accept-Language` = "en-GB,en;q=0.5",
                      `Accept-Encoding` = "gzip, deflate",
                      `Referer` = "https://www.covidanalytics.io/projections",
                      `Content-Type` = "application/json",
                      `X-CSRFToken` = "undefined",
                      `Origin` = "https://www.covidanalytics.io",
                      `Connection` = "keep-alive")
    
    post_data <- paste0('{"output":"page-content.children","outputs":{"id":',
                        '"page-content","property":"children"},"inputs":',
                        '[{"id":"url","property":"pathname","value":',
                        '"/projections"}],"changedPropIds":["url.pathname"]}')
    
    res <- httr::POST("https://www.covidanalytics.io/_dash-update-component", H,
                      body = post_data, encode = "raw")
    

    'res` now contains the json response, and our url-encoded csv is deep inside it. We get this parsed content and extract the string containing the url:

    body <- parsed_content(res)$response$`page-content`$children$props$children[[2]]
    div <- body$props$children[[10]]$props$children[[1]]
    url <- div$props$children$props$children$props$href
    

    Now we need to cut off the data:text/csv;charset=utf-8, part and unescape the url encoding. I actually found this was far quicker using nested gsubs, since my machine choked on URLdecode:

    csv <- strsplit(url, ",")[[1]][2]
    df <- read.csv(text = gsub("%0A", "\n", gsub("%20", " ", gsub("%2C", ",", csv))))
    

    Your data is now in df. It's big, so I'll show it as a tibble here:

    tidyr::as_tibble(df)
    #> # A tibble: 7,106 x 10
    #>    Continent Country Province Day   Total.Detected Active Active.Hospital~ Cumulative.Hosp~
    #>    <fct>     <fct>   <fct>    <fct> <fct>           <int>            <int>            <int>
    #>  1 Africa    Algeria None     2020~ 5651             1531              302              834
    #>  2 Africa    Algeria None     2020~ 5742             1514              300              848
    #>  3 Africa    Algeria None     2020~ 5831             1497              298              861
    #>  4 Africa    Algeria None     2020~ 5917             1477              296              874
    #>  5 Africa    Algeria None     2020~ 6000             1457              293              886
    #>  6 Africa    Algeria None     2020~ 6079             1435              291              898
    #>  7 Africa    Algeria None     2020~ 6156             1411              287              910
    #>  8 Africa    Algeria None     2020~ 6230             1387              284              921
    #>  9 Africa    Algeria None     2020~ 6300             1361              280              932
    #> 10 Africa    Algeria None     2020~ 6368             1335              277              942
    #> # ... with 7,096 more rows, and 2 more variables: Total.Detected.Deaths <int>,
    #> #   Active.Ventilated <int>