Search code examples
rweb-scrapinggdelt

How to scrape data from GDELT


I am struggling to scrape data from GDELT.

http://data.gdeltproject.org/events/index.html

I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.

Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.

I need your help.


Solution

  • The rvest package has the appropriate tools for this. We extract the href attributes from all link <a href = ...>...</a> nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv() will unpack, read, and combine the files for us!

    library(rvest)
    library(tidyverse)
    
    gdelt_index_url <- 
      "http://data.gdeltproject.org/events"
    
    gdelt_dom <- read_html(gdelt_index_url)
    
    url_df <- 
      gdelt_dom |> 
      html_nodes("a") |> 
      html_attr("href") |> 
      tibble() |> 
      set_names("path") |> 
      filter(str_detect(path, ".CSV.zip$")) |> 
      mutate(url = file.path(gdelt_index_url, path)) |> 
      slice(1:3) # For the purpose of demonstration we use only the first three files
      
    map2(url_df$url,
         url_df$path,
         download.file)
    
    gdelt_event_data <- 
      read_tsv(url_df$path, col_names = FALSE)