Search code examples
google-app-engineinsertentitiesgqlbigtable

Insert thousands entities in a reasonnable time into BigTable


I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:

import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery

def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
    region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
    mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)

It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function. When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached. I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:

  • Why is it so slow ?
  • Is there any way to do it in a reasonnable time ?

Thanks :)


Solution

  • First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.

    The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:

    • Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
    • Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.

    In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.