Search code examples
pysparkapache-spark-sqlaws-glueamazon-athena

Load a table from another database in pyspark


I am currently working with AWS and PySpark. My tables are stored in S3 and queryable from Athena.

In my Glue jobs, I'm used to load my tables as:

my_table_df = sparkSession.table("myTable")

However, this time, I want to access a table from another database, in the same data source (AwsDataCatalog). So I do something that works well:

my_other_table_df = sparkSession.sql("SELECT * FROM anotherDatabase.myOtherTable")

I am just looking for a better way to write the same thing, without using a SQL query, in one line, just by specifying the database for this operation. Something that should looks like

sparkSession.database("anotherDatabase").table("myOtherTable")

Any suggestion would be welcome


Solution

  • You can use the DynamicFrameReader for that. This will return you a DynamicFrame. You can just call .toDF() on that DynamicFrame to transform it into a native Spark DataFrame though.

    sc = SparkContext()
    glue_context = GlueContext(sc)
    spark = glue_context.spark_session
    job = Job(glue_context)
    
    data_source = glue_context.create_dynamic_frame.from_catalog(
                database="database",
                table_name="table_name"
    ).toDF()