Search code examples
pysparkjupyter-notebookaws-glueapache-iceberg

Can not read my glue catalog table from glue notebook with sparkdataframes


Hello I have built an apache iceberg database in s3 and added it to glue catalog so that I can query it from athena.

Now I am trying to perform some ETL from glue notebooks but it keeps on returning the following error

AnalysisExeption: org.apache.hadopp.hive.ql.metada.HiveException: Unable to fetch table my_table. StorageDescriptor#InputFormat cannot be null for table: my_table (Service: null; Status Code: 0; ErrorCode: null; Request ID: null; Proxy: null). I have tried two way of doing but they both throw the same error Scrip1:

%connections my-glue-connector
%glue_version 3.0
spark.stop()
sc.stop()

from pyspark.context import SpartContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

conf = SparkConf()
conf.set('spark.sql.catalog.mycatalog','org.apache.iceberg.spark.SparkCatalog')
conf.set('spark.sql.catalog.mycatalog.warehouse','s3://my_bucket/')
conf.set('spark.sql.catalog.glue_catalog.catalog-impl',org.apache.iceberg.aws.glue.GlueCatalog') 
conf.set('spark.sql.catalog.glue_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO')
conf.set('spark.sql.extensions','org.apache.iceberg.spark.extension.IcebergSparkSessionExtensions')

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

Script 2

%connections my-glue-connector
%glue_version 3.0
spark.stop()
sc.stop()

from pyspark.context import SpartContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

spark = SparkSession.builder\
.config('spark.sql.catalog.mycatalog','org.apache.iceberg.spark.SparkCatalog')\
.config('spark.sql.catalog.mycatalog.warehouse','s3://my_bucket/')\
.config('spark.sql.catalog.glue_catalog.catalog-impl',org.apache.iceberg.aws.glue.GlueCatalog')\
.config('spark.sql.catalog.glue_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO')\ 
.config('spark.sql.extensions','org.apache.iceberg.spark.extension.IcebergSparkSessionExtensions')\
.getOrCreate()

sc = spark.sparkContext
gc = GlueContext(sc)

I can run magic commands to create tables like

%%sql
CREATE TABLE AwsDataCatalog.mydatabase.mytable\
USING iceberg \
AS SELECT col1, col2(\
VALUES\
(1240,4.3)
)
AS t (col1,col2)

But I can not even retrieve that table that I can query in athena so it was indeed created.

SELECT * FROM mytable

wont work neither

SELECT * FROM my_catalog.mydatabase.mytable

I have used this link as a guide.


Solution

  • The problem is with the keyword my_catalog in spark initialization config. In AWS, the default catalog where all table exists is glue_catalog. Replace the config with my_catalog keyword with actual glue catalog for it to work.

    .config('spark.sql.catalog.glue_catalog','org.apache.iceberg.spark.SparkCatalog')\
    .config('spark.sql.catalog.glue_catalog.warehouse','s3://my_bucket/')\
    

    To query the table, you will simply,

    SELECT * FROM glue_catalog.mydatabase.mytable