Search code examples
apache-sparkpysparkaws-glueapache-iceberg

Write to Iceberg/Glue table from local PySpark session


I want to be able to operate (read/write) to an Iceberg table hosted on AWS Glue, from my local machine, using Python.

I have already:

  • Created an Iceberg table and registered it on AWS Glue
  • Populated the Iceberg table with limited data using Athena

I can access (read-only) the remote Iceberg table from my local laptop using PyIceberg, and now I want to write data to it. The problem is that Athena imposes some strict limits on write operations, and at the end of the day I’d like to write to the Iceberg table using a dataframe-like interface from Python, and the only option seems to be PySpark for now.

So, I’m, trying to do it, running a PySpark cluster on my local laptop, using the configurations I found on those refs:

The setup code seems to run fine, with the prints very similar to the reference video:

from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType, FloatType, LongType, StructType,StructField, StringType
import pyspark
import os

conf = (
    pyspark.SparkConf()
        .setAppName('luiz-session')
        #packages
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,software.amazon.awssdk:bundle:2.20.18,software.amazon.awssdk:url-connection-client:2.20.18,org.apache.spark:spark-hadoop-cloud_2.12:3.2.0')
        #SQL Extensions
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        #Configuring Catalog
        .set('spark.sql.catalog.glue', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.glue.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog')
        .set('spark.sql.catalog.glue.warehouse', "s3://my-bucket/iceberg-data")
        .set('spark.sql.catalog.glue.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
        #AWS CREDENTIALS
        .set('spark.hadoop.fs.s3a.access.key', os.environ.get("AWS_ACCESS_KEY_ID"))
        .set('spark.hadoop.fs.s3a.secret.key', os.environ.get("AWS_SECRET_ACCESS_KEY"))
)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

Now, when I try to run a query using this:

spark.sql("SELECT * FROM glue.iceberg_table LIMIT 10;").show()

I get the following error:

IllegalArgumentException: Cannot initialize Catalog implementation org.apache.iceberg.aws.glue.GlueCatalog: Cannot find constructor for interface org.apache.iceberg.catalog.Catalog
    Missing org.apache.iceberg.aws.glue.GlueCatalog [java.lang.NoClassDefFoundError: software/amazon/awssdk/services/glue/model/InvalidInputException]

I’ve been trying to change the fix this by changing the conf and copying the Iceberg jar releases to the spark home folder, but no luck so far.


Solution

  • the only way I found to develop local with glue and iceberg at the end was using the amazon/aws-glue-libs:glue_libs_4.0.0_image_01 docker image with DATALAKE_FORMATS=iceberg, and removing the set packages from spark configuration.

    Refs: