I would like to get a list all unique repositories on GutHub by using the following command:
SELECT DISTINCT repo_name FROM `bigquery-public-data.github_repos.commits`
However I get the following error:
Column repo_name of type ARRAY cannot be used in SELECT DISTINCT at [1:17]
In the schema it says repo_name is of type STRING, what am I doing wrong?
As another user posted, in the schema of the bigquery-public-data.github_repos.commits
table you can see that the repo_name
field is defined as a STRING REPEATED which means that each entry of repo_name
is an array constituted by string-type elements. You can see this with the following query:
#standardSQL
SELECT repo_name
FROM `bigquery-public-data.github_repos.commits`
LIMIT 100;
In order to find the distinct repo names you can employ the UNNEST
operator to expand each one of the repo_name
elements. The following query performs a CROSS JOIN that adds a new field repo_name_single
to the table constituted by the individual repository names. This way, the DISTINCT
function can be employed.
#standardSQL
SELECT DISTINCT(repo_name_unnest)
FROM `bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name_unnest;