Search code examples
amazon-s3lineboto

Is there a fast way of accessing line in AWS S3 file?


I have a collection of JSON messages in a file stored on S3 (one message per line). Each message has a unique key as part of the message. I also have a simple DynamoDB table where this key is used as the primary key. The table contains the name of the S3 file where the corresponding JSON message is located.

My goal is to extract a JSON message from the file given the key. Of course, the worst case scenario is when the message is the very last line in the file.

What is the fastest way of extracting the message from the file using the boto library? In particular, is it possible to somehow read the file line by line directly? Of course, I can read the entire contents to a local file using boto.s3.key.get_file() then open the file and read it line by line and check for the id to match. But is there a more efficient way?

Thanks much!


Solution

  • S3 cannot do this. That said, you have some other options:

    1. Store the record's length and position (byte offset) instead of the line number in DynamoDB. This would allow you to retrieve just that record using the Range: header.
    2. Use caching layer to store { S3 object key, line number } => { position, length } tuples. When you want to look up a record by { S3 object key, line number }, reference the cache. If you don't already have this data, you have to fetch the whole file like you do now -- but having fetched the file, you can calculate offsets for every line within it, and save yourself work down the line.
    3. Store the JSON record in DynamoDB directly. This may or may not be practical, given the 64 KB item limit.
    4. Store each JSON record in S3 separately. You could then eliminate the DynamoDB key lookup, and go straight to S3 for a given record.

    Which is most appropriate for you depends on your application architecture, the way in which this data is accessed, concurrency issues (probably not significant given your current solution), and your sensitivities for latency and cost.