Search code examples
pythontwitteramazon-dynamodbboto3

DynamoDB Querying in Python (Count with GroupBy)


This may be trivial, but I loaded a local DynamoDB instance with 30GB worth of Twitter data that I aggregated.

The primary key is id (tweet_id from the Tweet JSON), and I also store the date/text/username/geocode.

I basically am interested in mentions of two topics (let's say "Bees" and "Booze"). I want to get a count of each of those by state by day.

So by the end, I should know for each state, how many times each was mentioned on a given day. And I guess it'd be nice to export that as a CSV or something for later analysis.

Some issues I had with doing this...

First, the geocode info is a tuple of [latitude, longitude] so for each entry, I need to map that to a state. That I can do.

Second, is the most efficient way to do this to go through each entry and manually check if it contains a mention of either keyword and then have a dictionary for each that maps the date/location/count?

EDIT:

Since it took me 20 hours to load all the data into my table, I don't want to delete and re-create it. Perhaps I should create a global secondary index (?) and use that to search other fields in a query? That way I don't have to scan everything. Is that the right track?

EDIT 2:

Well, since the table is on my computer locally I should be OK with just using expensive operations like a Scan right?

So if I did something like this:

query = table.scan(
     FilterExpression=Attr('text').contains("Booze"),
     ProjectionExpression='id, text, date, geo', 
     Limit=100)

And did one scan for each keyword, then I would be able to go through the resulting filtered list and get a count of mentions of each topic for each state on a given day, right?

EDIT3:

response = table.scan(
   FilterExpression=Attr('text').contains("Booze"),
   Limit=100)
//do something with this set
while 'LastEvaluatedKey' in response:
   response = table.scan(
      FilterExpression=Attr('text').contains("Booze"),
      Limit=100, 
      ExclusiveStartKey=response['LastEvaluatedKey']
   )
   //do something with each batch of 100 entries

So something like that, for both keywords. That way I'll be able to go through the resulting filtered set and do what I want (in this case, figure out the location and day and create a final dataset with that info). Right?

EDIT 4

If I add:

ProjectionExpression='date, location, user, text' 

into the scan request, I get an error saying "botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the Scan operation: Invalid ProjectionExpression: Attribute name is a reserved keyword; reserved keyword: location". How do I fix that?

NVM I got it. Answer is to look into ExpressionAttributeNames (see: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ExpressionPlaceholders.html)


Solution

  • Yes, scanning the table for "Booze" and counting the items in the result should give you the total count. Please note that you need to do recursive scan until LastEvaluatedKey is null.

    Refer exclusive start key as well.

    Scan

    EDIT:-

    Yes, the code looks good. One thing to note, the result set wouldn't always contain 100 items. Please refer the LIMIT definition below (not same as SQL database).

    Limit — (Integer) The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed data set size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation. For more information, see Query and Scan in the Amazon DynamoDB Developer Guide.