amazon-web-services apache-spark amazon-emr aws-glue

How to get S3 key (path) of a table registered in AWS Glue Data Catalog from Spark job

What is the best way to find out full path (S3 key) to the data stored via AWS Glue Data Catalog using Spark (or PySpark)?

For example if I saved data following way:

my_spark_dataframe \
    .write.mode("overwrite") '
    .format("parquet") \
    .saveAsTable("database_name.table_name")

Solution

One way is to get metadata information of a given table and then extract Location portion:

full_s3_path = spark_session \
    .sql("describe formatted database_name.table_name") \
    .filter(col("col_name") == "Location") \
    .select("data_type").head()[0]

This will return:

# full_s3_path=s3://some_s3_bucket/key_to_table_name