Search code examples
amazon-web-servicesapache-sparkamazon-emraws-glue

How to get S3 key (path) of a table registered in AWS Glue Data Catalog from Spark job


What is the best way to find out full path (S3 key) to the data stored via AWS Glue Data Catalog using Spark (or PySpark)?

For example if I saved data following way:

my_spark_dataframe \
    .write.mode("overwrite") '
    .format("parquet") \
    .saveAsTable("database_name.table_name")

Solution

  • One way is to get metadata information of a given table and then extract Location portion:

    full_s3_path = spark_session \
        .sql("describe formatted database_name.table_name") \
        .filter(col("col_name") == "Location") \
        .select("data_type").head()[0]
    

    This will return:

    # full_s3_path=s3://some_s3_bucket/key_to_table_name