amazon-web-services amazon-s3 aws-lambda amazon-athena pyathena

AWS athena query result file fetching from s3 bucket

Currently I am working on AWS Athena. We have a webpage which will be displaying the query results. The data stored in the s3 bucket is ingested as part of the data lake, AWS Glue.

From our webpage multiple requests/query will be thrown to the AWS Athena. Since this is realtime querying I am not saving the query during the process, with a lambda function I am executing the query passed from the webpage and then displaying the results.

Now I can see the result in json format on the lambda output screen, we are planning to fetch the output file stored in s3 and the push it to the webpage. The limitations that we are facing is that each query result is stored in s3 with a unique UUID filename, when multiple request hits how do we get that data from the s3 bucket.

New file creation idea was on my mind but for multiple request from multiple places at a time we cant use that concept, i dont find any athena api that gets the output file also.

The second thing that came to my mind was with the request id, but i cannot find any reference to the request id in the output file. Please suggest.

I am new to Athena service. Thanks in advance :)

Solution

One thing you can do is to use a "client request token" when you start queries. This will tell Athena that if the same SQL has been run with the same token, it should not run the query again but return the previous query execution ID. See StartQueryExecution/ClientRequestToken.

Here's how it works: say you run the query SELECT foo FROM bar. If you run this query twice the first StartQueryExecution API call will return a different query execution ID from the second. However, if you pass a client request token with both API calls, both calls will return the same query execution ID. This works regardless of whether the query has finished or not, all that happens is that you get the same query execution ID. When you then do GetQueryExecution you will get the status of the query, and if it's finished you will get the output location – which will be the same.

The client request token feature is a kind of cache mechanism, and as with all caching, it's important to think about invalidation. Athena won't know if your underlying data has changed, so you should take care to use tokens that will change when your data changes, for example by including a time stamp. What I do when I don't have anything better is to hash the SQL and append a timestamp rounded off to the nearest hour (or five minutes, or some time period close to how often the underlying data refreshes).