Search code examples
google-bigqueryjobs

Load order of entires in big query tables


I have some sample data that I've been loading into Google BigQueries. I have been importing the data in ndjson format. If I load the data all in one file, I see them show up in a different order in the table's preview tab than when I sequentially import them one ndjson line at a time.

When importing sequentially I wait till I see the following output:

Waiting on bqjob_XXXX ... (2s) Current status: RUNNING
Waiting on bqjob_XXXX ... (2s) Current status: DONE 

The order the rows show up seems to match the order I append them as the job importing them seem to finish before I move on to the next. But when loading them all in one file, they show up in a different order than they exist in my data file.

So why do the data entries show up in a different order when loading in bulk? How are the data entries queued to be loaded and also how are they indexed into the table?


Solution

  • BigQuery has no notion of indexes. Data in BigQuery tables have no particular order that you can rely on. If you need to get ordered data out of BigQuery you will need to use explicit ORDER BY in your query - which btw quite not recommended for large results as it increases resource cost and can end up with Resources Exceeded error.

    BigQuery internal storage can "shuffle" your data rows internally for the best / most optimal performance of querying. So again - there is no such things as physical order of data in BigQuery tables

    Oficial language in docs is like this - line ordering is not guaranteed for compressed or uncompressed files.