I am trying to look up records that have 5 or more mentions of "BE" or "Belgium" in a nested field. The below query does not yield any results:
#standardSQL
SELECT
GKGRECORDID
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
and (V2Themes LIKE "%WB_%GROWTH%")
group by GKGRECORDID
having count(V2Locations LIKE "%BE%" OR V2Locations LIKE "%Belgium%")>5
I'd be grateful for any ideas.
If I am understanding your data correctly, you are trying to count the number of occurrences of “BE” or “Belgium” in each one of your V2Themes records. Therefore, in the following example the count should be 4?
1#Russia#RS#RS##60#100#RS#2475;1#Venezuela#VE#VE##8#-66#VE#471;1#Venezuela#VE#VE##8#-66#VE#1435;1#Venezuela#VE#VE##8#-66#VE#1521;1#Venezuela#VE#VE##8#-66#VE#2409;1#Russian#RS#RS##60#100#RS#2440;4#Brussels, Bruxelles-Capitale, Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#673;4#Brussels, Bruxelles-Capitale, Belgium#BE#BE11#5850#50.8333#4.33333#-1955538#2342;4#Quito, Pichincha,
If that is correct, one possible workaround would be the one explained here. Translating this solution into your needs (counting words instead of characters), I suggest to use the SPLIT method to divide the string with a given separator and count its elements with and without the strings you are searching for.This would be one solution for your problem:
#standardSQL
SELECT
GKGRECORDID,
(ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#BE', ''), "#"))) + (ARRAY_LENGTH(SPLIT(V2Locations, '#')) - ARRAY_LENGTH(SPLIT(REPLACE(V2Locations, '#Belgium', ''), "#"))) as bel_num,
V2Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-10-09') AND TIMESTAMP('2019-10-09')
and (V2Themes LIKE "%WB_%GROWTH%")
group by GKGRECORDID, V2Locations
having bel_num<5