Search code examples
apache-spark-sqlaws-gluedelta-lake

How to set up Spark SQL to work with Delta Lake tables with Glue metastore?


I followed this instruction to set up a Delta lake table and I can query it with Athena but not with Spark SQL. It is a Delta Lake table that has a metastore defined in GLUE.

If I execute the following query spark.sql("SELECT * FROM database_test.my_table where date='200904'), I get the error:

An error was encountered:
An error occurred while calling o723.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 139.0 failed 4 times, most recent failure: Lost task 0.3 in stage 139.0 (TID 1816) (ip-172-30-114-101.ec2.internal executor 2):
org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://my-bucket/users/deltalake-test/_symlink_format_manifest/date=200904/manifest, range: 0-177, partition values: [200904], isDataPresent: false, eTag: c6706a23e634cef2b86f8a829cb6645c

Is there another way to use GLUE as a metastore and run queries with Spark?


Solution

  • Looks like you have defined the Glue table definition to use the manifest approach which works for Athena. But that table definition in Glue will not work for Spark SQL.

    See https://docs.delta.io/latest/presto-integration.html#step-2-configure-presto-trino-or-athena-to-read-the-generated-manifests

    It's just that you can have one type of table definition that works with Spark, and another type of table-definition that works with Athena, but not both. For Spark only, just define the table as you would with a Hive metastore.