I'm fairly new to AWS Glue and would like to understand how to do the following:
So far I have this:
tableA_DF = pandas.read_cv("https://example.com/file.csv")
tableB = glueContext.create_dynamic_frame.from_catalog(database=Z, table_name = Y)
tableB_DF = TableB.toDF()
However, I'm not sure on how to join the two. Would like to simply add one column from tableB to tableA, and then store the result in the Data catalogue.
You can read directly into Dynamicframe from S3 using create_dynamic_frame_from_options
.
Lookupdata = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})
Then read the data from catalog using as shown below :
datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")
Then you can join two frames using join and then write this back to catalogue.
[joined_dyf][1] = Join.apply(Lookupdata,Join.apply(datasource0, Lookupdata,'someidfrom_datasource0','someidfrom_Lookupdata')
Now write this back to Catalogue as table refer to this for more information:
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)