Search code examples
google-bigqueryjupyter-notebookgoogle-cloud-datalab

Unexpected behavior from 'table.extract()' in Google datalab jupyter BigQuery notebook


I'm studying the datalab jupyter notebook tutorial ~/datalab/tutorials/BigQuery/'Importing and Exporting Data.ipynb'. I'm having trouble understanding the behavior of:

table.extract(destination = sample_bucket_object)

The resulting csv of this extraction contains:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000
1999,Chevy,Venture Extended Edition,,4900
1999,Chevy,Venture Extended Edition,Very Large,5000
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799

This looks incomplete. It extracted only those 4 rows inserted into the table from cars.csv when the table was first filled by:

sample_table.load('gs://cloud-datalab-samples/cars.csv', mode='append',
                  source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))

It ignores the additional 2 rows from cars2.csv added by the commands:

cars2 = storage.Item('cloud-datalab-samples', 'cars2.csv').read_from()
df2 = pd.read_csv(StringIO(cars2))
df2.fillna(value='', inplace=True)
sample_table.insert_data(df2)

which did make it into the table:

%%sql
SELECT * FROM sample.cars

gives:

Year    Make    Model   Description Price
1997    Ford    E350    ac, abs, moon   3000
1999    Chevy   Venture Extended Edition        4900
1999    Chevy   Venture Extended Edition    Very Large  5000
1996    Jeep    Grand Cherokee  MUST SELL! air, moon roof, loaded   4799
2015    Tesla   Model S     64900
2010    Honda   Civic       15000

As a test I switched cars.csv and cars2.csv in the notebook and re-ran all commands. The table.extract() then only exported the cars2.csv rows:

Year,Make,Model,Description,Price
2010,Honda,Civic,,15000
2015,Tesla,Model S,,64900

What am I missing here?


Solution

  • My guess is that sample_table.insert_data(df2) is using the streaming API to insert data into the table. Data inserted in this way can take some time to be available via copy and export operations, but is immediately available for querying.

    From BigQuery's documentation on streaming:

    Data can take up to 90 minutes to become available for copy and export operations. To see whether data is available for copy and export, check the tables.get response for a section named streamingBuffer. If that section is absent, your data should be available for copy or export.