pyspark apache-spark-sql aws-glue amazon-athena

Load a table from another database in pyspark

I am currently working with AWS and PySpark. My tables are stored in S3 and queryable from Athena.

In my Glue jobs, I'm used to load my tables as:

my_table_df = sparkSession.table("myTable")

However, this time, I want to access a table from another database, in the same data source (AwsDataCatalog). So I do something that works well:

my_other_table_df = sparkSession.sql("SELECT * FROM anotherDatabase.myOtherTable")

I am just looking for a better way to write the same thing, without using a SQL query, in one line, just by specifying the database for this operation. Something that should looks like

sparkSession.database("anotherDatabase").table("myOtherTable")

Any suggestion would be welcome

Solution

You can use the DynamicFrameReader for that. This will return you a DynamicFrame. You can just call .toDF() on that DynamicFrame to transform it into a native Spark DataFrame though.

sc = SparkContext()
glue_context = GlueContext(sc)
spark = glue_context.spark_session
job = Job(glue_context)

data_source = glue_context.create_dynamic_frame.from_catalog(
            database="database",
            table_name="table_name"
).toDF()