I am using the following code to query and paginate through a DynamoDB query:
class DecimalEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, decimal.Decimal):
return str(o)
return super(DecimalEncoder, self).default(o)
def run(date: int, start_epoch: int, end_epoch: int):
dynamodb = boto3.resource('dynamodb',
region_name='REGION',
config=Config(proxies={'https': 'PROXYIP'}))
table = dynamodb.Table('XYZ')
response = table.query(
# ProjectionExpression="#yr, title, info.genres, info.actors[0]", #THIS IS A SELECT STATEMENT
# ExpressionAttributeNames={"#yr": "year"}, #SELECT STATEMENT RENAME
KeyConditionExpression=Key('date').eq(date) & Key('uid').between(start_epoch, end_epoch)
)
for i in response[u'Items']:
print(json.dumps(i, cls=DecimalEncoder))
while 'LastEvaluatedKey' in response:
response = table.scan( ##IS THIS INEFFICIENT CODE?
# ProjectionExpression=pe,
# FilterExpression=fe,
# ExpressionAttributeNames=ean,
ExclusiveStartKey=response['LastEvaluatedKey']
)
for i in response['Items']:
print(json.dumps(i, cls=DecimalEncoder))
Although this code works, it is incredibly slow and I fear that 'response = table.scan
' is the result of this. I am under the impression that queries are much faster than scan's (as scans require an entire iteration of the table). Is this code causing a complete iteration of the database table?
This might be a separate question, but is there a more efficient way (with code examples) of doing this? I've attempted using Boto3's pagination but I could not get that working with queries either.
Unfortunately, yes, a "Scan" operation reads the entire table. You didn't say what is your table's partition key, but if it is a date, then what you are really doing here is to read a single partition, and this indeed, what a "Query" operation does much more efficiently, because it can jump directly to the required partition instead of scanning the entire table looking for it.
Even with Query, you still need to do paging exactly like you did, because there's a possibility that the partition still have a lot of items. But at least you won't be scanning the entire table.
By the way, scanning the entire table will cost you a lot of read operations. You can ask AWS how many reads were accounted for you, and this can help you catch cases where you're reading too much - beyond the obvious slowness which you noticed.