I am having trouble being able to accessing a table in the Glue Data Catalog using pySpark in Hue/Zeppelin on EMR. I have tried both emr-5.13.0 and emr-5.12.1.
I tried following https://github.com/aws-samples/aws-glue-samples/blob/master/examples/data_cleaning_and_lambda.md
but when trying to import the GlueContext it errors saying No module named awsglue.context.
Another note is that when doing a spark.sql("SHOW TABLES").show()
it comes up empty for Hue/Zeppelin but when using the pyspark shell on the master node I am able to see and query the table from the Glue Data Catalog.
Any help is much appreciated, thanks!
Ok, I spent some time to simulate the issue, so I spinned up an EMR, with "Use AWS Glue Data Catalog for table metadata" enabled. After enabling web connections, and in zeppelin I issued a show databases command, and it worked fine. Please find herewith the command & output from Zeppelin:
%spark
spark.sql("show databases").show
+-------------------+
|airlines-historical|
| default|
| glue-poc-tpch|
| legislator-new|
| legislators|
| nursinghomedb|
| nycitytaxianalysis|
| ohare-airport-2006|
| payments|
| s100g|
| s1g|
| sampledb|
| testdb|
| tpch|
| tpch_orc|
| tpch_parquet|
+-------------------+
As far as your other issue of "No module named awsglue.context", I think it may not be possible with an EMR commissioned Zeppelin. I think the only way, an awsglue.context can be accessed / used, is via a Glue Devendpoint that you may need to set up in AWS Glue, and then, use an glue jupyter notebook or a locally setup Zeppelin notebook connected to glue development endpoint.
Am not sure if glue context can be directly accessed from an EMR commissioned Zeppelin notebook, maybe am wrong.
You can still access the glue catalog, since EMR provides you with an option for the same, so you can access the databases and do your ETL jobs.
Thanks.