sql database-design google-bigquery google-cloud-bigtable

Unnesting a big quantity of columns in BigQuery and BigTable

I have a table in BigTable, with a single column family, containing some lead data. I was following the Google Cloud guide to querying BigTable data from BigTable (https://cloud.google.com/bigquery/external-data-bigtable) and so far so good.

I've crated the table definition file, like the docs required:

{
    "sourceFormat": "BIGTABLE",
    "sourceUris": [
        "https://googleapis.com/bigtable/projects/{project_id}/instances/{instance_id}/tables/{table_id}"
    ],
    "bigtableOptions": {
        "readRowkeyAsString": "true",
        "columnFamilies": [
            {
                "familyId": "leads",
                "columns": [
                    {
                        "qualifierString": "Id",
                        "type": "STRING"
                    },
                    {
                        "qualifierString": "IsDeleted",
                        "type": "STRING"
                    },
                    ...
                ]
            }
        ]
    }
}

But then, things started to go south...

This is how the BigQuery "table" ended up looking:

Each row is a rowkey and inside each column there's a nested cell, where the only value I need is the value from leads.Id.cell (in this case)

After a bit of searching I found a solution to this: https://stackoverflow.com/a/70728545/4183597

So in my case it would be something like this:

SELECT    
    ARRAY_TO_STRING(ARRAY(SELECT value FROM UNNEST(leads.Id.cell)), "") AS Id,
    ...
FROM xxx

The problem is that I'm dealing with a dataset with more than 600 columns per row. It is unfeasible (and impossible, given BigQuery's subquery limits) to repeat this process more than 600 times per row/query.

I couldn't think of a way to automate this query or even think about other methods to unnest this many cells (my SQL knowledge stops here).

Is there any way to do a unesting like this for 600+ columns, with an SQL/BigQuery query? Preferable in a more efficient way? If not, I'm thinking of doing a daily batch process, using a simple Python connector from BigTable to BigQuery, but I'm afraid of the costs this will incur.

Any documentation, reference or idea will be greatly appreciated.

Thank you.

Solution

In general, you're setting yourself up for a world of pain when you try to query a NoSQL database (like BigTable) using SQL. Unnesting data is a very expensive operation in SQL because you're effectively performing a cross join (which is many-to-many) every time UNNEST is called, so trying to do that 600+ times will give you either a query timeout or a huge bill.

The BigTable API will be way more efficient than SQL since it's designed to query NoSQL structures. A common pattern is to have a script that runs daily (such as a Python script in a Cloud Function) and uses the API to get that day's data, parse it, and then output that to a file in Cloud Storage. Then you can query those files via BigQuery as needed. A daily script that loops through all the columns of your data without requiring extensive data transforms is usually cheap and definitely less expensive than trying to force it through SQL.

That being said, if you're really set on using SQL, you might be able to use BigQuery's JSON functions to extract the nested data you need. It's hard to visualize what your data structure is without sample data, but you may be able to read the whole row in as a single column of JSON or a string. Then if you have a predictable path for the values you are looking to extract, you could use a function like JSON_EXTRACT_STRING_ARRAY to extract all of those values into an array. A Regex function could be used similarly as well. But if you need to do this kind of parsing on the whole table in order to query it, a batch job to transform the data first will still be much more efficient.