Search code examples
google-bigquerygdelt

Data collection from GDELT using bigquery


I am trying to construct an economic indicator based on all events with specific cameo codes from gdelt database.

So the idea is to collect data from 1990 to till date and see how economic cooperation varied based on news appearances of certain words. CAMEO codes 0211, 0311, 061, 1011 and 1211 in specific.

My query is how to extract this data for these specific cameo codes. If you can direct me to any source, it would be of great help.

One person suggested me that try using bigquery. I honestly don't know how to navigate the google bigquery page till now (I tried my best probably being from a non-tech background, it was a bit overwhelming for me). If any of you can help with one Cameo code data extraction example then I can play around with other events.

Edit: I am editing to show the progress I have made and the issues I am facing while running the query.

SELECT
   *
FROM
  [gdelt-bq:full.events]
WHERE
  Year >= 1979
  AND EventCode IN ('0211', '0311','061', '1011', '1211')
  AND Actor1CountryCode != Actor2CountryCode

This query will process 228 GB when run and also excludes the cases where both the country codes are null. It has over 2 million rows and I cant download this as a csv file from bigquery platform.

The part where I need help is the following, is there any way that I can get the total number of events for each event code which satisfies the following conditions

  1. Actor1Countrycode and Actor2CountryCode should be different except when they are null
  2. Count for each event code every month which satisfies the above condition.

PS: You can run the code given by Ben P in the answer below to see the number and type of columns in the database.

Edit2: There is another query that I am trying to write where in the AvgTone of an event with a specified code is greater than the average of AvgTone of all events in that particular month. Any leads on how to write this would be really helpful. Suppose, I add a WHERE clause wherein the AvgTone is greater than the average of AvgTone of all events for that particular period (MonthYear in this case). My doubt is how to write this in a query format.

SELECT
  MonthYear,
  COUNT(*)
FROM
  [gdelt-bq:full.events]
WHERE
  EventCode IN ('0211',
    '0311',
    '061')
  AND Actor1CountryCode != Actor2CountryCode
  AND AvgTone > (SELECT AVG(AvgTone) FROM [gdelt-bq:full.events] GROUP BY MonthYear ORDER BY MonthYear)
  GROUP BY
    MonthYear
  ORDER BY
    MonthYear

Error: ELEMENT can only be applied to result with 0 or 1 row.

Can someone help me with the above query? Thanks.


Solution

  • The GDELT database is available in BigQuery.

    Here is a link to their available datasets, your first step would to identify which contains the information you are interested in:

    https://blog.gdeltproject.org/the-datasets-of-gdelt-as-of-february-2016/

    Then this section of the site contains sample queries, which you can use as a starting point and try to tweak to your needs (note that these examples appear to me mostly in Legacy SQL, I would suggest you use them as a guide and rewrite then in Standard SQL):

    https://blog.gdeltproject.org/a-compilation-of-gdelt-bigquery-demos/

    If you have any specific SQL/BigQuery questions after you have done this I would recommend you come back with fresh questions and share examples of your working code, details what you have already tried and the results you expect to see.

    Having had a quick look, and I must say i am not familiar with the dataset, but this may be a simple query that can start you on your way:

     -- first we select all columns from the event dataset, which seems 
     -- to be the one you want, containing cameo codes
      SELECT * FROM `gdelt-bq.full.events`
      -- then we add a filter to only look at events in or after 1990
      WHERE Year >= 1990
      -- and another filter to look at only the specific camera 
      --codes you provided (I think EventCode is the correct column here, 
      AND EventCode IN ('0211','0311','061','1011','1211')
      -- finally, we add a limit to our query, so we don't bring back ALL 
      -- the results while testing, once we are happy with our query, we'd remove this!
      LIMIT 100
    

    Finally, the GDELT tag right here on StackOverflow contains some really great content.

    Hope that helps, GDELT looks like a fascinating project!