Search code examples
regexgoogle-bigquerygdelt

Bigquery Standard Dialect REGEXP_REPLACE input type


I am exploring the power of Google Biguery with the GDELT database using this tutorial however the sql dialect is in 'legacy' and I would like to use the standard dialect.

In legacy dialect:

SELECT
  theme,
  COUNT(*) AS count
FROM (
  SELECT
    REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*',"") theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300

and when I try to translate into standard dialect:

SELECT
  theme,
  COUNT(*) AS count
FROM (
  SELECT
    REGEXP_REPLACE(SPLIT(V2Themes,';') , r',.*', " ") AS theme
    FROM
      `gdelt-bq.gdeltv2.gkg`
    WHERE
      DATE>20150302000000
      AND DATE < 20150304000000
      AND V2Persons LIKE '%Netanyahu%' )
  GROUP BY
    theme
  ORDER BY
    2 DESC
  LIMIT
    300

it throws the following error:

No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:5]

it seems like I have to cast the result of the SPLIT() operation as a string. How do I do this?

UPDATE: I found a talk explaining the unnest operation:

SELECT
  COUNT(*),
  REGEXP_REPLACE(themes,",.*","") AS theme
FROM
  `gdelt-bq.gdeltv2.gkg_partitioned`,
  UNNEST( SPLIT(V2Themes,";") ) AS themes
WHERE
  _PARTITIONTIME >= "2018-08-09 00:00:00"
  AND _PARTITIONTIME < "2018-08-10 00:00:00"
  AND V2Persons LIKE '%Netanyahu%'
GROUP BY
  theme
ORDER BY
  2 DESC
LIMIT
  100

Solution

  • Flatten the array first:

    SELECT
      REGEXP_REPLACE(theme , r',.*', " ") AS theme,
      COUNT(*) AS count
    FROM
      `gdelt-bq.gdeltv2.gkg`,
      UNNEST(SPLIT(V2Themes,';')) AS theme
    WHERE
      DATE>20150302000000
      AND DATE < 20150304000000
      AND V2Persons LIKE '%Netanyahu%' 
    GROUP BY
      theme
    ORDER BY
      2 DESC
    LIMIT
      300
    

    The legacy SQL equivalent in your question actually has the effect of flattening the array as well, although it's implicit in the GROUP BY on the theme.