Join tables from different Glue catalogs with PySpark on EMR

To query a Glue Catalog from PySpark on EMR, I set the parameter hive.metastore.glue.catalogid in my cluster configuration.

Is it possible to join tables from different Glue catalogs (on different AWS accounts) ?

I tried to create a view with Athena from one AWS tenant to the other, but apparently PySpark is not able to query SQL views.

Solution

This is possible in Pyspark by setting the catalog separator config.

pyspark --conf spark.hadoop.aws.glue.catalog.separator="/"

The desired catalogs can then be selected directly from your Pyspark sql query. Note the catalog id (account id) is delimited by the separator /:

spark.sql(select * from `111122223333/demodb.tab1` t1 inner join  `444455556666/demodb.tab2` t2 on t1.col1 = t2.col2).show()