Search code examples
google-app-enginegoogle-cloud-datastoredatabase-backups

Recommended strategies for backing up appengine datastore


Right now I use remote_api and appcfg.py download_data to take a snapshot of my database every night. It takes a long time (6 hours) and is expensive. Without rolling my own change-based backup (I'd be too scared to do something like that), what's the best option for making sure my data is safe from failure?

PS: I recognize that Google's data is probably way safer than mine. But what if one day I accidentally write a program that deletes it all?


Solution

  • I think you've pretty much identified all of your choices.

    1. Trust Google not to lose your data, and hope you don't accidentally instruct them to destroy it.
    2. Perform full backups with download_data, perhaps less frequently than once per night if it is prohibitively expensive.
    3. Roll your own incremental backup solution.

    Option 3 is actually an interesting idea. You'd need a modification timestamp on all entities, and you wouldn't catch deleted entities, but otherwise it's very doable with remote_api and cursors.

    Edit:

    Here's a simple incremental downloader for use with remote_api. Again, the caveats are that it won't notice deleted entities, and it assumes all entities store the last modification time in a property named updated_at. Use it at your own peril.

    import os
    import hashlib
    import gzip
    from google.appengine.api import app_identity
    from google.appengine.ext.db.metadata import Kind
    from google.appengine.api.datastore import Query
    from google.appengine.datastore.datastore_query import Cursor
    
    INDEX = 'updated_at'
    BATCH = 50
    DEPTH = 3
    
    path = ['backups', app_identity.get_application_id()]
    for kind in Kind.all():
      kind = kind.kind_name
      if kind.startswith('__'):
        continue
      while True:
        print 'Fetching %d %s entities' % (BATCH, kind)
        path.extend([kind, 'cursor.txt'])
        try:
          cursor = open(os.path.join(*path)).read()
          cursor = Cursor.from_websafe_string(cursor)
        except IOError:
          cursor = None
        path.pop()
        query = Query(kind, cursor=cursor)
        query.Order(INDEX)
        entities = query.Get(BATCH)
        for entity in entities:
          hash = hashlib.sha1(str(entity.key())).hexdigest()
          for i in range(DEPTH):
            path.append(hash[i])
          try:
            os.makedirs(os.path.join(*path))
          except OSError:
            pass
          path.append('%s.xml.gz' % entity.key())
          print 'Writing', os.path.join(*path)
          file = gzip.open(os.path.join(*path), 'wb')
          file.write(entity.ToXml())
          file.close()
          path = path[:-1-DEPTH]
        if entities:
          path.append('cursor.txt')
          file = open(os.path.join(*path), 'w')
          file.write(query.GetCursor().to_websafe_string())
          file.close()
          path.pop()
        path.pop()
        if len(entities) < BATCH:
          break