Search code examples
sqlgoogle-bigquerysubstring

Exclude a string starting with certain letter by a SQL query in Google BigQuery


I'm trying to exclude values from a larger public table in Googles BigQuery using the following SQL lines. The last line has the purpose to exclude entries, that start with a certain letter, e.g. 'C'. For some reason, when I add the line, the count increases. Logically the selected rows should decrease and I can’t figure out why.

How can I make the exclusion work?

SELECT Count(*)
FROM `patents-public-data.patents.publications`,
unnest(description_localized) as description_unnest,
unnest(claims_localized) as claims_unnest,
unnest(cpc) as cpc_unnest
where description_unnest.language in ('en','EN')
and claims_unnest.language in ('en','EN')
and publication_date >19900101
and (SUBSTRING(cpc_unnest.code,1,1) <> 'C');

Solution

  • OK. I think I found the mistakes I made

    1. I compared the number of cases with and without this line #7. That was leading to the increased rows.

    #7 unnest(cpc) as cpc_unnest

    1. THIS IS MOST IMPORTANT: I did not want to know the number of rows, but the number of unique entries. As the table is build up according to the publication numbers, I can use this number to search for unique entries. The SQL command Count(DISTINCT attribut) can be used:

    SELECT Count(DISTINCT publication_number)

    The whole solution is this

    SELECT Count(DISTINCT publication_number)
    FROM `patents-public-data.patents.publications`,
    unnest(description_localized) as description_unnest,
    unnest(claims_localized) as claims_unnest,
    unnest(cpc) as cpc_unnest
    where description_unnest.language in ('en','EN')
    and claims_unnest.language in ('en','EN')
    and publication_date >19900101
    and (SUBSTRING(cpc_unnest.code,1,1) <> 'C')