Search code examples
google-bigquerygdelt

BigQuery: filter according to counts in nested field


I am trying to look up records that have 5 or more mentions of "BE" or "Belgium" in a nested field. The below query does not yield any results:

#standardSQL
SELECT

GKGRECORDID

FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')

and (V2Themes LIKE "%WB_%GROWTH%")

group by GKGRECORDID

having count(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5

I'd be grateful for any ideas.


Solution

  • If I am understanding your data correctly, you are trying to count the number of occurrences of “BE” or “Belgium” in each one of your V2Themes records. Therefore, in the following example the count should be 4?

    1#Russia#RS#RS##60#100#RS#2475;1#Venezuela#VE#VE##8#-66#VE#471;1#Venezuela#VE#VE##8#-66#VE#1435;1#Venezuela#VE#VE##8#-66#VE#1521;1#Venezuela#VE#VE##8#-66#VE#2409;1#Russian#RS#RS##60#100#RS#2440;4#Brussels, Bruxelles-Capitale, Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#673;4#Brussels, Bruxelles-Capitale, Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#2342;4#Quito, Pichincha,

    If that is correct, one possible workaround would be the one explained here. Translating this solution into your needs (counting words instead of characters), I suggest to use the SPLIT method to divide the string with a given separator and count its elements with and without the strings you are searching for.This would be one solution for your problem:

    #standardSQL
    SELECT
    
    GKGRECORDID,
    
    (ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#BE', ''), "#"))) + (ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#Belgium', ''), "#"))) as bel_num,
    
    V2Locations
    
    FROM `gdelt-bq.gdeltv2.gkg_partitioned`
    where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
    
    and (V2Themes LIKE "%WB_%GROWTH%")
    
    group by GKGRECORDID, V2Locations
    
    having bel_num<5