Search code examples
google-bigquerybigquery-public-datasets

Separating tags with BigQuery public dataset for stackoverflow


Google makes available a public dataset for the content of Stackoverflow. We can read about this here. When I login to the GCP Cloud Console and visit the BigQuery page and submit the following query:

select id, tags from `bigquery-public-data.stackoverflow.posts_questions` limit 10

The resulting table that is shown to me shows the tags field as concatenated items.

enter image description here

If I look at the JSON, I seem to see the same:

enter image description here

My supposition was that the tags would be delimited by the '|' character but the data seems to show otherwise. I'm hoping to understand this better. My end goal is to perform queries to find all questions that contain a given tag.


Solution

  • It was discovered that an error was at play in how the source data from Stackoverflows was being transformed into BigQuery tables. Google created an issue to resolve and eventually posted that it had been fixed. As such, this story/posting was transient and likely will not be replicable nor of value in the future.