Search code examples
google-cloud-datalab

Getting large datasets into Cloud Datalab


Is it possible to get large datasets into a pandas DataFrame?

My dataset is approx. 1.5 Gb uncompressed (input for clustering), but when I try and select the contents of the Table using bq.Query(...) it throws an exception:

RequestException: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

Looking at https://cloud.google.com/bigquery/querying-data?hl=en which states,

You must specify a destination table.

It feels like the only place to send large queries is another Table (and then click export to GCS and download).

There will also be a (possibly large write back) as the classified rows are written back to the database.

The same dataset runs fine on my 16Gb Laptop (matter of minutes) but I am looking at migrating to Datalab as our data moves to the cloud.

Thank you very much, any help appreciated


Solution

  • If you already have your results in a Table you can just use Table.to_dataframe()

    Otherwise you will need to run a Query using execute() with a destination table name specified as you noted, and allow_large_results=True parameter (following which you can do the to_dataframe() call as above).

    Note that you may have issues with this; the default VM that runs the Python kernel is pretty basic. You can deploy a more capable VM using URL parameters; e.g.:

    In the mean time, as mentioned you can deploy Datalab to a larger VM by some URL parameters. For example:

    http://datalab.cloud.google.com?cpu=2&memorygb=16