Is it possible to get large datasets into a pandas DataFrame?
My dataset is approx. 1.5 Gb uncompressed (input for clustering), but when I try and select the contents of the Table using bq.Query(...)
it throws an exception:
RequestException: Response too large to return. Consider setting allowLargeResults to true in your job configuration. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors
Looking at https://cloud.google.com/bigquery/querying-data?hl=en which states,
You must specify a destination table.
It feels like the only place to send large queries is another Table (and then click export to GCS and download).
There will also be a (possibly large write back) as the classified rows are written back to the database.
The same dataset runs fine on my 16Gb Laptop (matter of minutes) but I am looking at migrating to Datalab as our data moves to the cloud.
Thank you very much, any help appreciated
If you already have your results in a Table you can just use Table.to_dataframe()
Otherwise you will need to run a Query using execute() with a destination table name specified as you noted, and allow_large_results=True parameter (following which you can do the to_dataframe() call as above).
Note that you may have issues with this; the default VM that runs the Python kernel is pretty basic. You can deploy a more capable VM using URL parameters; e.g.:
In the mean time, as mentioned you can deploy Datalab to a larger VM by some URL parameters. For example:
http://datalab.cloud.google.com?cpu=2&memorygb=16