Search code examples
sqlgoogle-bigquerygdelt

Is there a way to remove characters in an array of string in BigQuery?


Using the GDELT public database in Google query, I am trying to find the top themes associated with Israeli Prime Minister Benjamin Netanyahu around March 3, 2015.

I used the following SQL query

SELECT theme, COUNT(*) as count
FROM (
select REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*', '') theme
from `gdelt-bq.gdeltv2.gkg`
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300

The split is necessary because the V2Themes column uses nested listings. I then want to remove the character offset. This should give me the following:

Row theme               count    
1   GENERAL_GOVERNMENT  33677    
2   LEADER              33405    
3   TAX_FNCACT_MINISTER 31174
4   ....                 ...

But I get an error instead:

No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING.    

I understand that SPLIT() creates an array of strings, but I do not know how to resolve this issue.

Is there another function I should use, or can this problem be resolved in another way?

*EDIT

The query works when it is run with Legacy SQL (also need to change the quotation marks to squared brackets). How can I achieve the same using standard SQL?


Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    SELECT theme, COUNT(*) AS COUNT
    FROM (
      SELECT REGEXP_REPLACE(value, r',.*', '') theme
      FROM `gdelt-bq.gdeltv2.gkg`, UNNEST(SPLIT(V2Themes,';')) value
      WHERE DATE>20150302000000 AND DATE < 20150304000000 
      AND V2Persons LIKE '%Netanyahu%'
    )
    GROUP BY theme
    ORDER BY 2 DESC
    LIMIT 300  
    

    Be aware of following: when you run query and it fails - the cost for you is zero (in most common cases).

    This is not a case when your query is correct (as above one in this answer) - in such cases query will successfully complete and you will be billed as per

    in Classic UI:

    enter image description here

    or in BigQuery Console:

    enter image description here