Search code examples
pysparkaws-glue

Running SQL script on the source side on multiple tables in a single Glue Job with corresponding table naming convention to S3


sql_list = ['(select * from table1 where rownum <= 100) alias1','(select * from table2 where rownum <= 100) alias2']

for sql_statement in sql_list: df = spark.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", sql_statement).option("user", db_username).option("password", db_password).option("fetchSize", 100000).load()

df.write.format("parquet").mode("overwrite").save("s3://s3-location/" + sql_statement)

The source was an Oracle DB

I was able to run the array of queries and stored it on S3 in parquet but the naming used was the same as what is listed on sql_list, I would like to store the data to S3 with naming as alias1 and alias2 repectively.


Solution

  • Consider using a dictionary instead of a list since this is neater and flexible.

        sql_list = {'alias1':'(select * from table1 where rownum <= 100) alias1',
                    'alias2': '(select * from table2 where rownum <= 100) alias2'}
    
        for table,sql_statement in sql_list.items():
            df = spark.read.format("jdbc").option("driver", jdbc_driver_name)\
                .option("url",db_url)\
                .option("dbtable", sql_statement)\
                .option("user", db_username)\
                .option("password", db_password)\
                .option("fetchSize",100000).load()
    
            df.write.format("parquet").mode("overwrite").save("s3://s3-location/" + table)
    

    Else you will need to do some dirty split

    df.write.format("parquet").mode("overwrite").save("s3://s3-location/" + sql_statement.split(' ')[-1])