I'm developing an application on Google App Engine using the current django non-rel and the now default, high replication datastore. I'm currently trying to bulk load a 180MB csv file locally on a dev instance with the following command:
appcfg.py upload_data --config_file=bulkloader.yaml --filename=../my_data.csv --kind=Place --num_threads=4 --url=http://localhost:8000/_ah/remote_api --rps_limit=500
bulkloader.yaml
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users
transformers:
- kind: Place
connector: csv
connector_options:
encoding: utf-8
columns: from_header
property_map:
- property: __key__
external_name: appengine_key
export_transform: transform.key_id_or_name_as_string
- property: name
external_name: name
The bulk load is actually successful for a truncated, 1000 record version of the CSV, but the full set eventually bogs down and starts erroring, "backing off" and waiting longer and longer. The bulkloader-log that I actually tail, doesn't reveal anything helpful and either does the server's stderr.
Any help in understanding this bulk load process would be appreciated. My plans are to be able to eventually load big data sets into the google data store, but this isn't promising.
180MB is a lot of data to load into the dev_appserver - it's not designed for large (or even medium) datasets; it's built entirely for small-scale local testing. Your best bet would be to reduce the size of your test dataset; if you can't do that, try the --use_sqlite
command line flag to use the new sqlite-based local datastore, which is more scalable.