My goal is to get the monthly number of protests in Mexico reported between the years 2004 and 2020. I am using Google BigQuery to get this data from the GDELT database.
My problem is that I am getting different results when running the same query on different tables.
select
GlobalEventID
,MonthYear
,ActionGeo_Long
,ActionGeo_Lat
from
gdelt-bq.full.events_partitioned -- Returns 34650 records
--gdelt-bq.gdeltv2.events_partitioned -- Returns 93551 records
where
_PARTITIONTIME >= TIMESTAMP('2004-01-01')
and _PARTITIONTIME <= TIMESTAMP('2020-12-31')
and EventRootCode = '14'
and ActionGeo_CountryCode = 'MX'
;
Can you tell me which table I should use and why the query results differ from each other?
According to the GDELT documentation, gdeltv2
contains more events, and is more up to date for recent years. However they may not have finished backpopulating it to 1979.
This query shows only 20340 of the 93563 event IDs existing in both tables, so for such a large time range you may get best results by using the v1 table before 2015, and the v2 table from 2015 onwards.
SELECT COUNT(*)
FROM gdelt-bq.gdeltv2.events_partitioned g2
JOIN gdelt-bq.full.events_partitioned g1 ON g1.GlobalEventID = g2.GlobalEventID
WHERE g2._PARTITIONTIME >= TIMESTAMP('2004-01-01')
AND g2._PARTITIONTIME <= TIMESTAMP('2020-12-31')
AND g2.EventRootCode = '14'
AND g2.ActionGeo_CountryCode = 'MX'
AND g1._PARTITIONTIME >= TIMESTAMP('2004-01-01')
AND g1._PARTITIONTIME <= TIMESTAMP('2020-12-31')
AND g1.EventRootCode = '14'
AND g1.ActionGeo_CountryCode = 'MX'