amazon-web-services amazon-s3 aws-glue aws-glue-data-catalog

Read CSV into AWS Glue and join with data from Data Catalogue

I'm fairly new to AWS Glue and would like to understand how to do the following:

Pull a CSV file from a URL in AWS Glue
Join the dataset with a column from a table I have in the Data Catalogue.
Write this back to the Data Catalogue as a new table.

So far I have this:

  tableA_DF = pandas.read_cv("https://example.com/file.csv")
  tableB = glueContext.create_dynamic_frame.from_catalog(database=Z, table_name = Y)
  tableB_DF = TableB.toDF()

However, I'm not sure on how to join the two. Would like to simply add one column from tableB to tableA, and then store the result in the Data catalogue.

Solution

You can read directly into Dynamicframe from S3 using create_dynamic_frame_from_options.

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Then read the data from catalog using as shown below :

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

Then you can join two frames using join and then write this back to catalogue.

[joined_dyf][1] = Join.apply(Lookupdata,Join.apply(datasource0, Lookupdata,'someidfrom_datasource0','someidfrom_Lookupdata')

Now write this back to Catalogue as table refer to this for more information:

sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
                           enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
                           partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)