Search code examples
google-bigqueryhttp-status-code-404google-cloud-storage

Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists


I have a service that is constantly updating files in GCS bucket with hive format:

bucket
    device_id=aaaa
        month=01
            part-0.parquet
        month=02
            part-0.parquet
        ....
    device_id=bbbb
        month=01
            part-0.parquet
        month=02
            part-0.parquet
        ....

If today we are at month=02 and I ran the following with BigQuery:

SELECT DISTINCT event_id
FROM `project_id.dataset.table` 
WHERE month = '02';

I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet

I checked and the file is there when the query ran.

If I run

SELECT DISTINCT event_id
FROM `project_id.dataset.table` 
WHERE month = '01';

I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.

Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.

I saw some posts that this could be related to my bucket been Multi-region.

Any other insights?


Solution

  • It could be for some reason that you get this error.

    • When you load data from Cloud Storage into a BigQuery table, the dataset that contains the table must be in the same regional or multi- regional location as the Cloud Storage bucket.
    • Due to consistency, for buckets, while metadata updates are strongly consistent for read-after-metadata-update operations, the process could take time to finish the changes.
    • Using a Multi-region bucket is not recommended.

    In this case, it could be due to consistency, because while you are updating the files GCS at the same time you are executing the query, so when you execute a query the parquet file was available to read and you didn’t get the error, but the next time the parquet file wasn’t available because the service was updating the file and you got the error.

    Unfortunately, there is not a simple way, to solve this problem, but here are some options:

    • You can add a pub/sub routine to the bucket and/or file and quick off your query after the service finished updating the files.
    • Make a workflow that blocks the updating of the files in their buckets until their query finishes.
    • If the query fails with “not found” for file ABCD and you have verified ABCD exists in GCS, then retry the query X times.
    • You need to backup your data into another location where you won't update these files constantly, just once a day.
    • You could move the data into a managed storage where you won't have this problem because you can do snapshotting.