How to scrape data from GDELT

I am struggling to scrape data from GDELT.

http://data.gdeltproject.org/events/index.html

I aim to write code that automatically downloads, unzips, and merges files during specific periods, but despite numerous attempts, I have failed to do so.

Although the "gdeltr2" package exists, it does not retrieve some variables correctly from the original data.

I need your help.

Solution

The rvest package has the appropriate tools for this. We extract the href attributes from all link <a href = ...>...</a> nodes, filter down to those that end with ".CSV.zip" and build the full URLs. Now we can download each file and readr::read_tsv() will unpack, read, and combine the files for us!

library(rvest)
library(tidyverse)

gdelt_index_url <- 
  "http://data.gdeltproject.org/events"

gdelt_dom <- read_html(gdelt_index_url)

url_df <- 
  gdelt_dom |> 
  html_nodes("a") |> 
  html_attr("href") |> 
  tibble() |> 
  set_names("path") |> 
  filter(str_detect(path, ".CSV.zip$")) |> 
  mutate(url = file.path(gdelt_index_url, path)) |> 
  slice(1:3) # For the purpose of demonstration we use only the first three files
  
map2(url_df$url,
     url_df$path,
     download.file)

gdelt_event_data <- 
  read_tsv(url_df$path, col_names = FALSE)