Search code examples
amazon-s3amazon-athenadata-lake

Querying across S3 buckets using Athena


I am trying to understand Data Lakes, and most examples show only simple use cases. What I want to understand is effectively 'join queries'.

For example, I have files with product data (uploaded to S3-Product-Data) and a database with product annual sales (uploaded to S3-Product-Sales). How does AWS Lakes / Athena make a query that runs across these two environments?

Of course something will need to link them. I am just not sure what the query would look like, nor what Athena does under the covers to merge the data (and be performant).


Solution

  • What you do is create a table in Athena that references the files with product data, and another table that references the files with annual sales. After that you can run SQL that combines the tables.

    Exactly how the SQL would look depends on your data, what columns it has, etc. If your product data has a product_id column, and your sales data does too, you can join them like this (the column names are of course all made up):

    SELECT product_name, SUM(sales.sold_for) AS total_revenue
    FROM products
    LEFT JOIN sales USING (product_id)