Unfortunately I have no experience with BigQuery or programming in general, but I need data from GDELT for my thesis and can't access it through the analysis tool. That's why I created this query to determine the number of protests in Iran in September 2022. However, the number I get as a result is much too high (compared e.g., ACLED). Do I need to refine my query so that I only get the count of individual events?
SELECT COUNT(\*) FROM gdelt-bq.gdeltv2.events_partitioned where \_PARTITIONTIME \>= TIMESTAMP('2022-09-01') and \_PARTITIONTIME \<= TIMESTAMP('2022-09-30') and EventRootCode = '14' and ActionGeo_CountryCode = 'IR' ;
(As another test: ACLED lists >10.600 demonstrations in the USA from 24.05-22.08.2020, with my query I receive 123.500 as result)
Your query is counting all mentions of protests, but GDELT records the same event multiple times from different sources, inflating the count.
To get the number of unique protests in Iran for September 2022, modify your query to count distinct events instead of all mentions.
SELECT COUNT(DISTINCT GLOBALEVENTID) AS unique_protests
FROM `gdelt-bq.gdeltv2.events_partitioned`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2022-09-01') AND TIMESTAMP('2022-09-30')
AND EventRootCode = '14'
AND ActionGeo_CountryCode = 'IR';
You can filter based on confidence scores:
NumMentions > X
→ Only count events reported multiple times.AvgTone < 0
→ Focus on negative-toned events (protests usually have a negative tone).