Search code examples
pythongoogle-app-enginebulkloader

Long (and failing) bulk data loads to Google App Engine datastore


I'm developing an application on Google App Engine using the current django non-rel and the now default, high replication datastore. I'm currently trying to bulk load a 180MB csv file locally on a dev instance with the following command:

appcfg.py upload_data --config_file=bulkloader.yaml --filename=../my_data.csv --kind=Place --num_threads=4 --url=http://localhost:8000/_ah/remote_api --rps_limit=500

bulkloader.yaml

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users

transformers:

- kind: Place
  connector: csv 
  connector_options:
      encoding: utf-8
      columns: from_header

  property_map:
    - property: __key__
      external_name: appengine_key
      export_transform: transform.key_id_or_name_as_string

- property: name
  external_name: name

The bulk load is actually successful for a truncated, 1000 record version of the CSV, but the full set eventually bogs down and starts erroring, "backing off" and waiting longer and longer. The bulkloader-log that I actually tail, doesn't reveal anything helpful and either does the server's stderr.

Any help in understanding this bulk load process would be appreciated. My plans are to be able to eventually load big data sets into the google data store, but this isn't promising.


Solution

  • 180MB is a lot of data to load into the dev_appserver - it's not designed for large (or even medium) datasets; it's built entirely for small-scale local testing. Your best bet would be to reduce the size of your test dataset; if you can't do that, try the --use_sqlite command line flag to use the new sqlite-based local datastore, which is more scalable.