Search code examples
amazon-web-servicesamazon-s3amazon-rdsamazon-athena

how to analyze data in aws athena from postgresql pg dump in s3?


I have taken pg dump of large tables (>50 GB) in AWS S3 in csv.gz format from postgresql RDS. I want to store this data into AWS athena in the form of actual tables as in the RDS, so that when requirement is there i can directly use the data without restoring those tables in the RDS.

Any suggestions will be much appreciated.


Solution

  • If each file contains data from one table in CSV format, you can create a table to query that data.

    I find the easiest way is to use AWS Glue:

    • Put each file relating to a table in a separate subdirectory in S3
    • Create a crawler in AWS Glue, pointing it to the directory
    • It will automatically create a table in Amazon Athena that lets you query the data

    If you wish to improve performance of the queries, I would recommend that you convert the data into Snappy-compressed Parquet format.

    From Examples of CTAS queries - Amazon Athena:

    CREATE TABLE new_table
    WITH (
          format = 'Parquet',
          write_compression = 'SNAPPY')
    AS SELECT *
    FROM old_table;