Search code examples
amazon-web-servicesamazon-s3amazon-redshiftamazon-athenaamazon-redshift-spectrum

AWS Redshift Spectrum - how to get the s3 filenames in the external table


I have external tables created in AWS spectrum to query the s3 data however i am not able to identify the filenames which the record belongs to(i have thousands of files under a bucket)

In AWS Athena we have a pseudo column "$PATH" which will display the s3 filenames is there any similar ways available while using spectrum?


Solution

  • Since recently, you can use specific pseudo-columns to access the path and the size of the object in S3 for lineage information.

    http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-external-tables-pseudocolumns

    An example for such a query would be:

    >> select distinct "$path", "$size" from spectrum.sales_part;
    
     $path                                 | $size
    ---------------------------------------+-------
    s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/ |  1616
    s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ |  1444
    s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ |  1444