Search code examples
google-bigquerygdelt

How to get TONE from GDELT GKG table from Google BigQuery?


SELECT
DATE,
EXTRACT(YEAR FROM DATE) AS year,
FIPS as Country,
LOCATIONS,
AVG(TONE) as Avg_Tone,
AVG(Positive Score) as PositiveS,
AVG(Negative Score) as NegativeS,
COUNT(*),
From `gdelt-bq.gdeltv2.gkg_partitioned`,
`gdelt-bq.extra.sourcesbycountry` country,

Where
DATE(_PARTITIONTIME) BETWEEN TIMESTAMP('2002-01-01') AND TIMESTAMP('2020-12-31')
AND SourceCommonName=country.Domain
AND Location like '%CH%'
GROUP BY Year,Country
ORDER BY  Year,Country

Codebook link is http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf The V1.5TONE has TONE, Positive Score and Negative Score and so on. I want to count the average Tone by year. How to get it from Big Query.


Solution

  • There is the need to cast and split the fields first.

    • The date is a value formated as "yyyymmdd....". Therefore, I suggest casting the value as a string and consider the first four characters as the year.

    • There is no V1.5TONE column, but V2Tone. It consists of a string with several digital numbers seperated by a comma. The string has to be split first. Then each component needs to be casted a decimal number.

    • The table gdelt-bq.extra.sourcesbycountry should map an url to a single country. It has duplicated countries to one url! To eliminate at least some duplicated value the inner select with the group by is used.

    • To obtain the values grouped by Year,Country all other dimensional columns need to be commented out.

    SELECT
    #DATE,
    substr(cast(date as string),0,4) AS year,
    FIPS as Country,
    #LOCATIONS,
    AVG(cast(split(V2Tone,",")[safe_offset(0)] as decimal )) as Avg_Tone,
    AVG(cast(split(V2Tone,",")[safe_offset(1)] as decimal )) as PositiveS,
    AVG(cast(split(V2Tone,",")[safe_offset(2)] as decimal )) as NegativeS,
    COUNT(*) as counts,
    From `gdelt-bq.gdeltv2.gkg_partitioned`
    left join
    (SELECT Domain, FIPS 
    from `gdelt-bq.extra.sourcesbycountry`
    group by 1,2) country
    on  SourceCommonName=country.Domain
    
    Where
    DATE(_PARTITIONTIME) BETWEEN DATE('2020-01-01') AND DATE('2020-01-31')
    AND Locations like '%CH%'
    GROUP BY Year,Country
    ORDER BY  Year,Country
    

    Also in where segment: Do not mix DATE and TIMESTAMP.