Search code examples
javagoogle-app-enginebulkloader

Java Google App Engine bulk loader upload error “OverflowError: date value out of range”


I'm using the bulk loader to download my Review kind from one GAE server and upload to another GAE server. This Review kind contains about 30,000 entities.

I use the command to upload:

appcfg.py upload_data --kind=Review --filename=review.csv --application=<MY_APP_ID> --url=http://<MY_APP_ID>.appspot.com/remote_api --rps_limit=500 --bandwidth_limit=2500000 --batch_size=100

NOTE: The entire error is pretty long. So I've included it at the bottom of this question.

The upload doesn't start at all and stops at the line: [INFO ] Opening database: bulkloader-progress-20110611.222101.sql3

I see at the bottom of the entire trace: lambda val: _EPOCH + datetime.timedelta(microseconds=val), OverflowError: date value out of range

I suspect that this is caused by badly formated date's found in my publisheddate column (see below for entire Review kind).


What I've already tried to fix it

I ran the GQL: SELECT * FROM Review where publisheddate = null in the GAE admin portal and found around 15 entities with null values for the publisheddate column. So I deleted those, redownloaded and tried reuploading the Review kind. I still get the same error.

I'm thinking, since I'm using the upload command without a config file parameter e.g. --config_file=bulkloader.yaml (see above upload that I'm not specifying the --config_file)

I'm relying on the datastore statistics to determine what is downloaded to my review.csv file.

Now since I just deleted the entities with null publisheddate this redownloading and then reuploading immediately is not doing anything different from my initial download and upload. Since the datastore statistics aren't updated yet.


My Questions

  1. As there are 30, 000 entities I'm unsure as to how I can locate entities with badly formated dates. Is there a quick away to locate entities with bad publisheddate values?

  2. Does anyone know how to fix this issue? I'm not sure from the error trace what the real problem is.


Review kind

I'm using the objectify3.0.jar to manipulate my GAE datastore. So my Review kind looks like this:

public class Review {
    @Id private Long id; //This is my key property, auto generated by objectify  
    private String reviewlink;  
    private Date publisheddate;
    private double rating;      
    private Long gameid;    //to make querying easier

    //ommitting getters and setters 
}

Entire error

10:20:48>appcfg.py upload_data --kind=Review --filename=review.csv --application=<MY_APP_ID> --url=http://<MY_APP_ID>.appspot.com/remote_api --rps_limit=500 --bandwidth_limit=2500000 --batch_size=100
Uploading data records.
[INFO    ] Logging to bulkloader-log-20110611.222101
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 2500000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 500/second
[INFO    ] Batch Size: 100
[INFO    ] Opening database: bulkloader-progress-20110611.222101.sql3
Traceback (most recent call last):
  File "/usr/local/bin/appcfg.py", line 76, in <module>
    run_file(__file__, globals())
  File "/usr/local/bin/appcfg.py", line 72, in run_file
    execfile(script_path, globals_)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/appcfg.py", line 3592, in <module>
    main(sys.argv)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/appcfg.py", line 3583, in main
    result = AppCfgApp(argv).Run()
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/appcfg.py", line 2229, in Run
    self.action(self)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/appcfg.py", line 3368, in __call__
    return method()
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/appcfg.py", line 3195, in PerformUpload
    run_fn(args)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/appcfg.py", line 3087, in RunBulkloader
    sys.exit(bulkloader.Run(arg_dict))
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 4360, in Run
    return _PerformBulkload(arg_dict)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 4204, in _PerformBulkload
    loader.initialize(filename, loader_opts)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 2879, in initialize
    self.high_id_table = self._find_high_id(self.generate_records(filename))
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 2896, in _find_high_id
    for values in record_generator:
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/bulkloader.py", line 2920, in generate_records
    yield datastore.Entity._FromPb(fixed_entity_proto)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/datastore.py", line 958, in _FromPb
    value = datastore_types.FromPropertyPb(prop)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/datastore_types.py", line 1792, in FromPropertyPb
    value = conversion(value)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/datastore_types.py", line 1722, in <lambda>
    lambda val: _EPOCH + datetime.timedelta(microseconds=val),
OverflowError: date value out of range

Solution

  • Method to find problematic row

    The method I used to find the problematic row was using the bulk loader WITH a config file to download each row until it reached a row it couldn't download. Once I corrected the problmatic row(s) the download/upload worked fine.

    NOTE: In fact it was multiple rows but once I found the first one it was just a matter of finding/replacing the others with the offending data.

    The command I used to find the row was:

    appcfg.py upload_data --coinfig_file=bulkloader.yaml --kind=Review --filename=review.csv --application=<MY_APP_ID> --url=http://<MY_APP_ID>.appspot.com/remote_api --rps_limit=500 --bandwidth_limit=2500000 --batch_size=100
    

    NOTE: In my question above, I was NOT using a config file. In this command just above. I'm am using a config file. This config file is generated from the datastore statistics using this command:

    appcfg.py create_bulkloader_config --filename=bulkloader.yaml --application=<appname> --url=http://<appname>.appspot.com/remote_api
    

    The primary reason for using the config file is that it downloads my datastore formated in a proper csv file.


    Question?

    For some reason, when you don't specify a config file with the bulk loader. The downloaded csv is all messy i.e. it contains alot of random characters. However, there are no problems uploading these files back to your datastore.

    Why is this the case that both properly formated csv files and these messy csv files work when uploading to your datastore?


    The problematic row(s)

    The actual problem was that some of the data for the publisheddate property had values set as 00/00/0000 (MM/DD/YYYY). Java converted it to a Date with a value of Sunday 2nd January 0001 (Don't quote me on the value, but I recall it was 0001 for the year).

    This was get inserted into the GAE datastore fine, but when the bulk loader tries to download or upload these values, it complains with the error: lambda val: _EPOCH + datetime.timedelta(microseconds=val), OverflowError: date value out of range as posted above.