Search code examples
sqlgoogle-cloud-platformgoogle-bigquerygithub-api

SQL Query validation failure on GCP BigQuery with github_repos dataset


I would like to get a list all unique repositories on GutHub by using the following command:

SELECT DISTINCT repo_name FROM `bigquery-public-data.github_repos.commits`

However I get the following error:

Column repo_name of type ARRAY cannot be used in SELECT DISTINCT at [1:17]

In the schema it says repo_name is of type STRING, what am I doing wrong?


Solution

  • As another user posted, in the schema of the bigquery-public-data.github_repos.commits table you can see that the repo_name field is defined as a STRING REPEATED which means that each entry of repo_name is an array constituted by string-type elements. You can see this with the following query:

    #standardSQL
    SELECT repo_name 
    FROM `bigquery-public-data.github_repos.commits` 
    LIMIT 100;
    

    In order to find the distinct repo names you can employ the UNNEST operator to expand each one of the repo_name elements. The following query performs a CROSS JOIN that adds a new field repo_name_single to the table constituted by the individual repository names. This way, the DISTINCT function can be employed.

    #standardSQL
    SELECT DISTINCT(repo_name_unnest) 
    FROM `bigquery-public-data.github_repos.commits` 
    CROSS JOIN UNNEST(repo_name) AS repo_name_unnest;