Search code examples
sql-likegoogle-bigquery

How to filter column values according to a list of strings


I am trying to filter a dataset by comparing a list of strings to column values.

This works well using "LIKE" and one string, using 3GB.

#standardSQL

SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64)) 
ridanxietycnt, 
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount   
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019- 
02-02') 
AND V2Themes LIKE 'ECON_INFLATION'

group by daydate

However, when using more than one string with "LIKE", the query gets suddenly very large (8 TB).

#standardSQL

SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c1.3:(\d+)') as FLOAT64)) anew,
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'c12.1:(\d+)') as FLOAT64)) 
ridanxietycnt, 
sum(SAFE_CAST(REGEXP_EXTRACT(GCAM, r'wc:(\d+)') as FLOAT64)) wordcount   
FROM `gdelt-bq.gdeltv2.gkg_partitioned` t
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019- 
02-02') 
AND V2Themes LIKE 'ECON_INFLATION' OR V2Themes LIKE 'ECON_STOCKMARKET'
group by daydate

Is there a more efficient (and cheaper) way to compare column values to a list of strings? Any ideas would be much appreciated.


Solution

  • Careful with logic and OR precedence.

    This:

    WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02') 
    AND V2Themes LIKE 'ECON_INFLATION' 
    OR V2Themes LIKE 'ECON_STOCKMARKET'
    

    is not the same than:

    WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02') 
    AND (
      V2Themes LIKE 'ECON_INFLATION' 
      OR V2Themes LIKE 'ECON_STOCKMARKET'
    )
    

    The first one doesn't one the partitioning filters, but the second one does.