Hello I have built an apache iceberg database in s3 and added it to glue catalog so that I can query it from athena.
Now I am trying to perform some ETL from glue notebooks but it keeps on returning the following error
AnalysisExeption: org.apache.hadopp.hive.ql.metada.HiveException: Unable to fetch table my_table. StorageDescriptor#InputFormat cannot be null for table: my_table (Service: null; Status Code: 0; ErrorCode: null; Request ID: null; Proxy: null). I have tried two way of doing but they both throw the same error Scrip1:
%connections my-glue-connector
%glue_version 3.0
spark.stop()
sc.stop()
from pyspark.context import SpartContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
conf = SparkConf()
conf.set('spark.sql.catalog.mycatalog','org.apache.iceberg.spark.SparkCatalog')
conf.set('spark.sql.catalog.mycatalog.warehouse','s3://my_bucket/')
conf.set('spark.sql.catalog.glue_catalog.catalog-impl',org.apache.iceberg.aws.glue.GlueCatalog')
conf.set('spark.sql.catalog.glue_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO')
conf.set('spark.sql.extensions','org.apache.iceberg.spark.extension.IcebergSparkSessionExtensions')
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
Script 2
%connections my-glue-connector
%glue_version 3.0
spark.stop()
sc.stop()
from pyspark.context import SpartContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder\
.config('spark.sql.catalog.mycatalog','org.apache.iceberg.spark.SparkCatalog')\
.config('spark.sql.catalog.mycatalog.warehouse','s3://my_bucket/')\
.config('spark.sql.catalog.glue_catalog.catalog-impl',org.apache.iceberg.aws.glue.GlueCatalog')\
.config('spark.sql.catalog.glue_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO')\
.config('spark.sql.extensions','org.apache.iceberg.spark.extension.IcebergSparkSessionExtensions')\
.getOrCreate()
sc = spark.sparkContext
gc = GlueContext(sc)
I can run magic commands to create tables like
%%sql
CREATE TABLE AwsDataCatalog.mydatabase.mytable\
USING iceberg \
AS SELECT col1, col2(\
VALUES\
(1240,4.3)
)
AS t (col1,col2)
But I can not even retrieve that table that I can query in athena so it was indeed created.
SELECT * FROM mytable
wont work neither
SELECT * FROM my_catalog.mydatabase.mytable
I have used this link as a guide.
The problem is with the keyword my_catalog in spark initialization config. In AWS, the default catalog where all table exists is glue_catalog
. Replace the config with my_catalog
keyword with actual glue catalog for it to work.
.config('spark.sql.catalog.glue_catalog','org.apache.iceberg.spark.SparkCatalog')\
.config('spark.sql.catalog.glue_catalog.warehouse','s3://my_bucket/')\
To query the table, you will simply,
SELECT * FROM glue_catalog.mydatabase.mytable