amazon-web-services amazon-s3 amazon-rds amazon-athena

how to analyze data in aws athena from postgresql pg dump in s3?

I have taken pg dump of large tables (>50 GB) in AWS S3 in csv.gz format from postgresql RDS. I want to store this data into AWS athena in the form of actual tables as in the RDS, so that when requirement is there i can directly use the data without restoring those tables in the RDS.

Any suggestions will be much appreciated.

Solution

If each file contains data from one table in CSV format, you can create a table to query that data.

I find the easiest way is to use AWS Glue:

Put each file relating to a table in a separate subdirectory in S3
Create a crawler in AWS Glue, pointing it to the directory
It will automatically create a table in Amazon Athena that lets you query the data

If you wish to improve performance of the queries, I would recommend that you convert the data into Snappy-compressed Parquet format.

From Examples of CTAS queries - Amazon Athena:

CREATE TABLE new_table
WITH (
      format = 'Parquet',
      write_compression = 'SNAPPY')
AS SELECT *
FROM old_table;