Search code examples
hiveapache-spark-sqlamazon-emraws-gluedelta-lake

Spark SQL queries against Delta Lake Tables using Symlink Format Manifest


I'm running spark 3.1.1 and an AWS emr-6.3.0 cluster with the following hive/metastore configurations:

spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

I have a Delta Lake table defined in AWS Glue that points to an s3 location, specifically to a symlink manifest file (for Redshift Spectrum integration):

location=s3://<bucket>/<database>/<table>/_symlink_format_manifest

When I run spark spark sql queries (either from a spark application or spark-shell) against the table (ie "select * from database.table limit 10"), I get the following exception:

Caused by: java.lang.RuntimeException: s3://<bucket?/<database>/<table>/_symlink_format_manifest/manifest is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]

It seems that spark SQL / hive is not set up to read the manifest file that the external table table defined in Glue is pointing to to enable redshift integration. I'm wondering if anyone else has encountered / troubleshooted this problem and if there's a way to configure spark or hive to correctly read the Delta Lake parquet files without changing the external table definition in Glue (which is pointing to a manifest ).


Solution

  • I ended up figuring this out myself.

    You can save yourself a lot of pain and misunderstanding by grasping the distinction between querying a delta lake external table (via glue) and querying a delta lake table directly, see: https://docs.delta.io/latest/delta-batch.html#read-a-table

    In order to query the delta lake table directly without having to interact or go through the external table, simple change the table reference in your spark sql query to the following format:

    delta.`<table-path>`
    

    For example,

    spark.sql("""select * from delta.`s3://<bucket>/<key>/<table-name>/` limit 10""")