Search code examples
pysparkamazon-emraws-glue

Join tables from different Glue catalogs with PySpark on EMR


To query a Glue Catalog from PySpark on EMR, I set the parameter hive.metastore.glue.catalogid in my cluster configuration.

Is it possible to join tables from different Glue catalogs (on different AWS accounts) ?

I tried to create a view with Athena from one AWS tenant to the other, but apparently PySpark is not able to query SQL views.


Solution

  • This is possible in Pyspark by setting the catalog separator config.

    pyspark --conf spark.hadoop.aws.glue.catalog.separator="/"
    

    The desired catalogs can then be selected directly from your Pyspark sql query. Note the catalog id (account id) is delimited by the separator /:

    spark.sql(select * from `111122223333/demodb.tab1` t1 inner join  `444455556666/demodb.tab2` t2 on t1.col1 = t2.col2).show()
    

    Source AWS Doc