I am trying to filter a dataset by comparing a list of strings to column values.
This works well using "LIKE" and one string, using 3GB.
#standardSQL
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64))
ridanxietycnt,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-
02-02')
AND V2Themes LIKE 'ECON_INFLATION'
group by daydate
However, when using more than one string with "LIKE", the query gets suddenly very large (8 TB).
#standardSQL
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64))
ridanxietycnt,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-
02-02')
AND V2Themes LIKE 'ECON_INFLATION' OR V2Themes LIKE 'ECON_STOCKMARKET'
group by daydate
Is there a more efficient (and cheaper) way to compare column values to a list of strings? Any ideas would be much appreciated.
Careful with logic and OR
precedence.
This:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
AND V2Themes LIKE 'ECON_INFLATION'
OR V2Themes LIKE 'ECON_STOCKMARKET'
is not the same than:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
AND (
V2Themes LIKE 'ECON_INFLATION'
OR V2Themes LIKE 'ECON_STOCKMARKET'
)
The first one doesn't one the partitioning filters, but the second one does.