Search code examples
amazon-web-servicesapache-sparkpysparkaws-glueapache-iceberg

Unable to query Iceberg table from PySpark script in AWS Glue


I'm trying to read data from an iceberg table, the data is in ORC format and partitioned by column. I'm getting this error -

AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table temp_tag_thrshld_iceberg. StorageDescriptor#InputFormat cannot be null for table: temp_tag_thrshld_iceberg (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

This is my code :

spark = SparkSession.builder.config("spark.driver.memory", "25g").appName(app_name).getOrCreate()
temp_tag_thrshld_data = spark.sql("SELECT * FROM dev_db.temp_tag_thrshld_iceberg")

If I replace my spark.sql("Select * from a_normal_athena_table) the code runs fine. I'm also not able to read the data directly from S3 as its an ORC format with Snappy compression so I don't get any results (I'm probably missing the correct framework to read S3 ORC directly but that's another issue for another day)

I've tried validating my table using

aws glue get-table --database-name dev_db --name temp_tag_thrshld_iceberg

and this is the output I got -

{ "Table": { "Name": "temp_tag_thrshld_iceberg", "DatabaseName": "dev_db", "CreateTime": 1658864256.0, "UpdateTime": 1658864347.0, "Retention": 0, "StorageDescriptor": { "Columns": [ { "Name": "tag", "Type": "int", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "1", "iceberg.field.optional": "true" } }, { "Name": "zipcode", "Type": "int", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "2", "iceberg.field.optional": "true" } }, { "Name": "threshold_max", "Type": "double", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "3", "iceberg.field.optional": "true" } }, { "Name": "level", "Type": "string", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "4", "iceberg.field.optional": "true" } } ], "Location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg", "Compressed": false, "NumberOfBuckets": 0, "SortColumns": [], "StoredAsSubDirectories": false }, "TableType": "EXTERNAL_TABLE", "Parameters": { "metadata_location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00001-0ee5fbc7-044e-439d-aa1e-d76935002ebd.metadata.json", "previous_metadata_location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00000-3a8f33f0-fbef-48c3-b289-6021f62b8b8c.metadata.json", "table_type": "ICEBERG" }, "CreatedBy": "IAM Details", "IsRegisteredWithLakeFormation": false, "CatalogId": "571708111280", "VersionId": "1" } }

Updated the config to this (based on iceberg table configuration):

spark = SparkSession.builder.config("spark.driver.memory", "25g")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.appName(app_name).getOrCreate()

I'm getting this new error -

An error occurred while calling o87.sql. Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.iceberg.spark.SparkSessionCatalog


Solution

  • To read Iceberg tables in Glue you have to use the Apache Iceberg Connector for AWS Glue:

    https://aws.amazon.com/marketplace/pp/prodview-iicxofvpqvsio

    And below is a blog for your reference which talks about fetching data from iceberg with AWS Glue in detail

    https://aws.amazon.com/blogs/big-data/use-the-aws-glue-connector-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/